Practicals - Multiple sequence alignment
Contents
- Introduction
- Prerequisites
- Resources
- Global multiple alignment with ClustalX
- Local multiple alignment with dialign
- Local multiple alignment with MUSCLE
- Exercises
[back to contents]
Introduction
In this tutorial, we will first analyze a family of well conserved
proteins (Homoserine O-succinyltransferase), and then address some
more complex cases.
[back to contents]
Prerequisites
This tutorial uses some material which was collected in the tutorials
on biomolecular databases.
[back to contents]
Resources
- We will use clustalX to align multiple sequences. ClustalX
should be installed on your computer. If this is not the case, the
latest release can be found at
- ClustalX includes some basic facilities to infer phylogenetic
tree, but does not support the visualization of such trees. For this,
we will use two user-friendly programs that were specifically designed
to display phylogenetic trees.
- NJplot is the companion program of ClustalX to visualize
thephylogenetic trees.
The Windows version of ClustalX already include NJplot. For
other operating systems, you can download the application NJplot from the Web site).
NJplot presents nice features, which make it slightly more
convenient to use than TreeView. In particular, it allows to display
the branch lengths and bootstrap values on the tree.
- TreeView is another program for displaying and
manipulating phylogenetic trees. This program should be installed on
your computer. If this is not the case, the latest release can be
obtained here.
Beware, the dendrogram-visualization program TreeView
should not be mixed up with the microarray-visualization program of
the same name developed by Michael Eisen, and which will be used for
the practical on microarrays.
[back to contents]
Global multiple alignment with ClustalX
Aligning highly conserved sequences: Homoserine O-succinyltransferase
- Connect SRS and retrieve
the sequences of all proteins annotated as "Homoserine O-succinyltransferase"
in the description field of UniProt/Swiss-Prot (the Swiss-Prot section of
UniProt). Save these sequences in fasta format.
- Start clustalX and load the sequence file. Explore clustalX
menu to locate the different functionalities of the program.
- Do the complete alignment. Notice that the program exports
two files :
- .dnd the guide tree (clustal format)
- .aln the aligned sequences (clustal format)
- Analyze the result. Locate the conserved residues, terminal and
internal gaps. Compare the column content with the profile below the
alignment.
- In the menu Quality, select the command Calculate low
scoring segments. Analyze the result. Are low scoring segments
associated to some specific columns ? To some specific sequences ?
- In the Edit menu, run Select all sequences.
- In the Quality menu, run the command Save column
score to file. The score profile is saved in a file with
extension .qscores. Open this file with a text editor. This
file contains a textual representation of the alignment, together with
the scores associated to each column.
[back to contents]
Aligning distant proteins: the Zinc cluster proteins
The Homoserine O-succinyltransferase family that we analyzed above is
quite well conserved, and clustalX returned a good alignment over the
whole sequence length.
We will now analyze a more difficult case: the family of proteins
containing a binuclear Zinc cluster domain.
- Open a new ClustalX application, and load the file containing
Saccharomyces cerevisiae Zinc cluster protein sequences (this
file has been obtained during the practical on biomolecular databases).
- We will now align ~60 sequences. This means that, for building
the guide tree, clustalX will perform n*(n-1)/2 ~1891 pairwise
alignments. By default, these alignment are produced with a dynamical
programming algorithm, which takes a time proportional to the product
of sequence lengths. The consequence is that it will take a few
minutes to build the guide tree. This is OK if you have time for a
coffee break, but for this practical we will use an alternative
option: clustal includes an option for building the guide tree with a
faster (but less accurate) algorithm. We will select this option.
In the menu Alignment, open the option Alignment
parameters > Pairwise parameters. A dialog box appears, with a
pop-up menu Pairwise Alignments. Select the
option Fast-Approximate.
Beware: the option "Fast-appromximate" is sub-optimal, and
should be used only for alignments involving many sequences. Since
computer speed is regularly increasing, the actual time will depend
very much on your particular hardware. We can thus not give general
recommendations. A good way to treat this choice is to test the
required time with the Slow-Accurate option, and, in case it
would take too much time, to stop the process and restart with the
option Fast-Approximate.
- Run the command Do Guide Tree Only. ClustalX will
perform a pairwise alignment between each pair of input sequence, and
then apply a hierarchical clustering algorithm in order to determine
the order in which sequences will be incorporated in the multiple
alignment. The progress of the process is displayed at the bottom of
the ClustalX window, so you can estimate if the processing speed is
reasonable enough.
- Once the guide tree has been built, run the command Do
Alignment from Guide Tree. You can also follow the progress of the
task on the status bar at the bottom of the ClustalX window.
Analyze the results. How does it compare with the Homoserine
O-succinyltransferase alignment ? You can evaluate the rate of
conservation at each column of the alignment by inspecting the
conservation profile (bottom panel of the ClustalX window). Try to
locate the well-conserved regions. Do they extend over the whole
alignment, or are they localized to particular segments ? Why ?
[back to contents]
Improving the alignment by re-aligning selected sequences or
columns>Improving the quality of multiple alignment
ClustalX includes some features that do not exist in the precursor
program (ClustalW): after having obtained the automatic alignment with
the procedure above, some additional functions allow you to estimate
the quality of the alignment, in a column-wise or in a sequence-wise
way.
The command Calculate low-scoring segments estimates the
correspondence between each piece of individual sequence and the
corresponding columns of the alignment, and highlights poorly aligned
segments. Low-scoring segments can reveal two types of problems.
- Misaligned sequences: some segments of individual sequences may
not fit the profiles of the columns with which they were aligned. Such
misalignment may result from imperfections of the progressive
alignment: early incorporated sequences may have ben aligned in a
sub-optimal way, and re-aligning them once the complete alignemnt has
been obtained miht return a better result. For this problem, some
improvement might be obtained by re-aligning the misaligned sequences
with the rest of the alignment (command Alignment > Realigned
Selected Sequences).
- Misaligned columns: some columns may columns may be poorly aligned
due to the presence of indels (insertions or deletions) that were not
properly treated during the progressive alignment. The result might be
improved by re-aligning these columns with the command Realign
selected residue range.
We will apply this strategy with the multiple alignment of yeast Zn
cluster proteins obtained in the previous section.
- Run the command Calculate low scoring segments. Examine
carefully the positions which are not darkened. Notice that
cystein occurs at several high scoring positions. Count the number of
positions where this amino acid is conserved.
- A few sequences are not well aligned with the conserved
domain. Select these sequences in the left panel of ClustalX. Run the
command Realign selected sequences. These sequences appear now
at the bottom of the alignment.
Questions:
did you obtain any improvement by realigning selected sequences ? How
do you interpret this result ?
- ClustalX also allows you to realign selected columns. This is
particularly useful if the similarity is restricted to some segments
of the sequence alignment, due to the presence of some conserved
domain within larger sequences that also contain non-conserved
parts. This is exactly our case: the Zn cluster domain is located on
the left side of the alignment, and, from the right of this domain,
the alignment looks rather bad.
In the alignment, select the columns containing the conserved
position, and extend the selection a bit further (for example, select
columns 70 to 160), and run the command Realign selected residue
range in the Align menu.
Questions: did you obtain an improvement by realigning
selected residues ? How do you interpret this result ?
- The recalculation may have introduced some columns completely
made of gaps. You can remove these with the command Remove gap only
columns in the Edit menu. Try it and analyze the new
alignment.
- In the left panel, select the GAL4 sequence. In the Quality
menu, run the command Save column score to file. Open the
resulting file with a text editor. Notice that the highest scoring
positions are concentrated in the beginning of the sequence.
- Open a session in SRS, and select the UniProt entry for
GAL4 in Saccharomyces cerevisiae. Compare the highest scoring
residues (in the clustal alignment) with the UniProt features of
GAL4. Which features do you find in the high-scoring segments of the
alignment ?
[back to contents]
Inferring a phylogenetic tree with ClustalX
The multiple alignment generated by ClustalX can be used to infer a
phylogenetic tree for the set of input sequences. By default, ClustalX
stores the alignment result in a file with extension .aln,
which is located in the same folder as your input sequences (the fasta
file that was loaded as input sequences).
After having generated the multiple alignment, you can directly
run the command Tree > Draw Tree, in order to obtain a
phylogenetic tree.
The phylogenetic tree computed with the command "Draw Tree" is
stored in a file with extension .ph in the same folder as
your input sequences. Note that the extension .ph refers to
the phylogenetic tree, by opposition to the extension .dnd
which indicates the guide tree (dnd stands for "dendogram").
In order to visualize the result, you need to open a helper program
(NJplot or TreeView, see the
section Resources).
Beware: the NJ-tree generated with the
command "Draw Tree" differs from the guide tree generated before the
alignment. Indeed, the guide tree was based on pairwise
alignments between each pair of input sequences, whereas the
"post-alignment" NJ-tree is built by comparing residues between each
pair of sequences in the multple alignment. The
"post-alignment" NJ-Tree is thus more robust than the guide tree. In
short, the guide tree should not be considered as valid for inferring
phylogeny. The NJ-tree is a good first step towards phylogeny
inference. However, NJ-tree is still a ver basic approach: because
phylogeny inverence includes many other methods, that are based on
more complex evolutionary models, and are likely to return more
reliable hypotheses.
More details are given in the tutorial
on phylogeny.
[back to contents]
Local multiple alignment with dialign
To be written
[back to contents]
Local multiple alignment with MUSCLE
To be written
[back to contents]
Exercises
- With SRS,
retrieve the sequences of all the proteins from Swiss-Prot having the
word "aspartokinase" in their description. Align these sequences with
ClustalX and analyze the result. You can distinguish two parts in the
alignment: one segment is found in all the proteins whereas the other
one is found in a subset only, How do you interpret this result (don't
hesitate to read the Swiss-Prot annotations for these proteins in
order to understand their content).
- With SRS, retrieve all the Zn(2)Cys(6) binuclear cluster proteins
from Schizzosaccharomyces pombe. Aligne these proteins with
clustalX.
- Align the two previous alignments( Zn(2)Cys(6) binuclear cluster
from Saccharomyces cerevisiae and Schizosaccharomyces
pombe, respectively.
[back to contents]
Jacques van Helden
(van-helden.j@univmed.fr), last
revised Dec 11, 2008.