Practicals - Multiple sequence alignment

Introduction
Prerequisites
Resources
Global multiple alignment with ClustalX
Local multiple alignment with dialign
Local multiple alignment with MUSCLE
Exercises

Introduction

In this tutorial, we will first analyze a family of well conserved proteins (Homoserine O-succinyltransferase), and then address some more complex cases.

[back to contents]

Prerequisites

This tutorial uses some material which was collected in the tutorials on biomolecular databases.

[back to contents]

Resources

We will use clustalX to align multiple sequences. ClustalX should be installed on your computer. If this is not the case, the latest release can be found at
ClustalX includes some basic facilities to infer phylogenetic tree, but does not support the visualization of such trees. For this, we will use two user-friendly programs that were specifically designed to display phylogenetic trees.
1. NJplot is the companion program of ClustalX to visualize thephylogenetic trees. The Windows version of ClustalX already include NJplot. For other operating systems, you can download the application NJplot from the Web site).
  NJplot presents nice features, which make it slightly more convenient to use than TreeView. In particular, it allows to display the branch lengths and bootstrap values on the tree.
2. TreeView is another program for displaying and manipulating phylogenetic trees. This program should be installed on your computer. If this is not the case, the latest release can be obtained here.
  Beware, the dendrogram-visualization program TreeView should not be mixed up with the microarray-visualization program of the same name developed by Michael Eisen, and which will be used for the practical on microarrays.

[back to contents]

Global multiple alignment with ClustalX

Aligning highly conserved sequences: Homoserine O-succinyltransferase

Connect SRS and retrieve the sequences of all proteins annotated as "Homoserine O-succinyltransferase" in the description field of UniProt/Swiss-Prot (the Swiss-Prot section of UniProt). Save these sequences in fasta format.
Start clustalX and load the sequence file. Explore clustalX menu to locate the different functionalities of the program.
Do the complete alignment. Notice that the program exports two files :
- .dnd the guide tree (clustal format)
- .aln the aligned sequences (clustal format)
Analyze the result. Locate the conserved residues, terminal and internal gaps. Compare the column content with the profile below the alignment.
In the menu Quality, select the command Calculate low scoring segments. Analyze the result. Are low scoring segments associated to some specific columns ? To some specific sequences ?
In the Edit menu, run Select all sequences.
In the Quality menu, run the command Save column score to file. The score profile is saved in a file with extension .qscores. Open this file with a text editor. This file contains a textual representation of the alignment, together with the scores associated to each column.

[back to contents]

Aligning distant proteins: the Zinc cluster proteins

The Homoserine O-succinyltransferase family that we analyzed above is quite well conserved, and clustalX returned a good alignment over the whole sequence length.

We will now analyze a more difficult case: the family of proteins containing a binuclear Zinc cluster domain.

Open a new ClustalX application, and load the file containing Saccharomyces cerevisiae Zinc cluster protein sequences (this file has been obtained during the practical on biomolecular databases).

We will now align ~60 sequences. This means that, for building the guide tree, clustalX will perform n*(n-1)/2 ~1891 pairwise alignments. By default, these alignment are produced with a dynamical programming algorithm, which takes a time proportional to the product of sequence lengths. The consequence is that it will take a few minutes to build the guide tree. This is OK if you have time for a coffee break, but for this practical we will use an alternative option: clustal includes an option for building the guide tree with a faster (but less accurate) algorithm. We will select this option.

In the menu Alignment, open the option Alignment parameters > Pairwise parameters. A dialog box appears, with a pop-up menu Pairwise Alignments. Select the option Fast-Approximate.
Beware: the option "Fast-appromximate" is sub-optimal, and should be used only for alignments involving many sequences. Since computer speed is regularly increasing, the actual time will depend very much on your particular hardware. We can thus not give general recommendations. A good way to treat this choice is to test the required time with the Slow-Accurate option, and, in case it would take too much time, to stop the process and restart with the option Fast-Approximate.
Run the command Do Guide Tree Only. ClustalX will perform a pairwise alignment between each pair of input sequence, and then apply a hierarchical clustering algorithm in order to determine the order in which sequences will be incorporated in the multiple alignment. The progress of the process is displayed at the bottom of the ClustalX window, so you can estimate if the processing speed is reasonable enough.

Once the guide tree has been built, run the command Do Alignment from Guide Tree. You can also follow the progress of the task on the status bar at the bottom of the ClustalX window.

Analyze the results. How does it compare with the Homoserine O-succinyltransferase alignment ? You can evaluate the rate of conservation at each column of the alignment by inspecting the conservation profile (bottom panel of the ClustalX window). Try to locate the well-conserved regions. Do they extend over the whole alignment, or are they localized to particular segments ? Why ?

[back to contents]

Improving the alignment by re-aligning selected sequences or columns>Improving the quality of multiple alignment

ClustalX includes some features that do not exist in the precursor program (ClustalW): after having obtained the automatic alignment with the procedure above, some additional functions allow you to estimate the quality of the alignment, in a column-wise or in a sequence-wise way.

The command Calculate low-scoring segments estimates the correspondence between each piece of individual sequence and the corresponding columns of the alignment, and highlights poorly aligned segments. Low-scoring segments can reveal two types of problems.

Misaligned sequences: some segments of individual sequences may not fit the profiles of the columns with which they were aligned. Such misalignment may result from imperfections of the progressive alignment: early incorporated sequences may have ben aligned in a sub-optimal way, and re-aligning them once the complete alignemnt has been obtained miht return a better result. For this problem, some improvement might be obtained by re-aligning the misaligned sequences with the rest of the alignment (command Alignment > Realigned Selected Sequences).
Misaligned columns: some columns may columns may be poorly aligned due to the presence of indels (insertions or deletions) that were not properly treated during the progressive alignment. The result might be improved by re-aligning these columns with the command Realign selected residue range.

We will apply this strategy with the multiple alignment of yeast Zn cluster proteins obtained in the previous section.

Run the command Calculate low scoring segments. Examine carefully the positions which are not darkened. Notice that cystein occurs at several high scoring positions. Count the number of positions where this amino acid is conserved.

A few sequences are not well aligned with the conserved domain. Select these sequences in the left panel of ClustalX. Run the command Realign selected sequences. These sequences appear now at the bottom of the alignment.

Questions:

ClustalX also allows you to realign selected columns. This is particularly useful if the similarity is restricted to some segments of the sequence alignment, due to the presence of some conserved domain within larger sequences that also contain non-conserved parts. This is exactly our case: the Zn cluster domain is located on the left side of the alignment, and, from the right of this domain, the alignment looks rather bad.

In the alignment, select the columns containing the conserved position, and extend the selection a bit further (for example, select columns 70 to 160), and run the command Realign selected residue range in the Align menu.

Questions:

The recalculation may have introduced some columns completely made of gaps. You can remove these with the command Remove gap only columns in the Edit menu. Try it and analyze the new alignment.

In the left panel, select the GAL4 sequence. In the Quality menu, run the command Save column score to file. Open the resulting file with a text editor. Notice that the highest scoring positions are concentrated in the beginning of the sequence.

Open a session in SRS, and select the UniProt entry for GAL4 in Saccharomyces cerevisiae. Compare the highest scoring residues (in the clustal alignment) with the UniProt features of GAL4. Which features do you find in the high-scoring segments of the alignment ?

[back to contents]

Inferring a phylogenetic tree with ClustalX

The multiple alignment generated by ClustalX can be used to infer a phylogenetic tree for the set of input sequences. By default, ClustalX stores the alignment result in a file with extension .aln, which is located in the same folder as your input sequences (the fasta file that was loaded as input sequences).

After having generated the multiple alignment, you can directly run the command Tree > Draw Tree, in order to obtain a phylogenetic tree.

The phylogenetic tree computed with the command "Draw Tree" is stored in a file with extension .ph in the same folder as your input sequences. Note that the extension .ph refers to the phylogenetic tree, by opposition to the extension .dnd which indicates the guide tree (dnd stands for "dendogram").

In order to visualize the result, you need to open a helper program (NJplot or TreeView, see the section Resources).

Beware: the NJ-tree generated with the command "Draw Tree" differs from the guide tree generated before the alignment. Indeed, the guide tree was based on pairwise alignments between each pair of input sequences, whereas the "post-alignment" NJ-tree is built by comparing residues between each pair of sequences in the multple alignment. The "post-alignment" NJ-Tree is thus more robust than the guide tree. In short, the guide tree should not be considered as valid for inferring phylogeny. The NJ-tree is a good first step towards phylogeny inference. However, NJ-tree is still a ver basic approach: because phylogeny inverence includes many other methods, that are based on more complex evolutionary models, and are likely to return more reliable hypotheses.

More details are given in the tutorial on phylogeny.

[back to contents]

Local multiple alignment with dialign

To be written

[back to contents]

Local multiple alignment with MUSCLE

To be written

[back to contents]

Exercises

With SRS, retrieve the sequences of all the proteins from Swiss-Prot having the word "aspartokinase" in their description. Align these sequences with ClustalX and analyze the result. You can distinguish two parts in the alignment: one segment is found in all the proteins whereas the other one is found in a subset only, How do you interpret this result (don't hesitate to read the Swiss-Prot annotations for these proteins in order to understand their content).
With SRS, retrieve all the Zn(2)Cys(6) binuclear cluster proteins from Schizzosaccharomyces pombe. Aligne these proteins with clustalX.
Align the two previous alignments( Zn(2)Cys(6) binuclear cluster from Saccharomyces cerevisiae and Schizosaccharomyces pombe, respectively.

[back to contents]

Jacques van Helden (van-helden.j@univmed.fr), last revised Dec 11, 2008.

Practicals - Multiple sequence alignment

Contents

Introduction

Prerequisites

Resources

Global multiple alignment with ClustalX

Aligning highly conserved sequences: Homoserine O-succinyltransferase

Aligning distant proteins: the Zinc cluster proteins

Improving the alignment by re-aligning selected sequences or columns>Improving the quality of multiple alignment

Inferring a phylogenetic tree with ClustalX

Local multiple alignment with dialign

Local multiple alignment with MUSCLE

Exercises