Online Lectures on Bioinformatics

Multiple sequence alignmentIntroductionFor many genes a database search will reveal a whole number of homologous sequences. One then wishes to learn about the evolution and the sequence conservation in such a group. This question surpasses what can reasonably be achieved by the sequence comparison methods described in the previous sections. Pairwise comparisons do not readily show positions that are conserved among a whole set of sequences and tend to miss subtle similarities that become visible when observed simultaneously among many sequences. Thus one wants to simultaneously compare several sequences. A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column. Like in a pairwise alignment, when a sequence does not possess an amino acid in a particular position this is again denoted by a dash. Like for pairwise alignments there are conventions regarding the scoring of a multiple alignment. In one approach, one simply adds the scores of all the induced pairwise aligments contained in a multiple alignment. For a linear gap penalty this amounts to scoring each column of the alignment by the sum of the amino acid pair scores in this column. The corresponding score is called the sum of pairs (SP) score. (see section on Algorithms for SPoptimal multiple alignments). Although it would be biologically meaningful, the distinctions between global, local and other forms of alignment are rarely made in a multiple alignment. The reason for this will become apparent below when we describe the computational difficulties in computing multiple alignments. Note that the full set of optimal pairwise alignments among a given set of sequences will generally overdetermine the multiple alignment. If one wishes to assemble a multiple alignment from pairwise alignments one has to avoid "closing loops", i.e. one can put together pairwise alignments as long as no new pairwise alignment is included to a sequence which is already part the multiple alignment. In particular, pairwise alignments can be merged when they align one sequence to all others, when a linear order of the given sequence is maintained, or when sequences pairs with pairwise alignments form a tree. While all these schemes allow for the ready definition of algorithms that output multiply aligned sequences, they do not inlcude any information stemming from the simultaneous analysis of several sequences. The alternative approach is to generalize the dynamic programming optimization procedure applied for pairwise alignment to the delineation of a multiple alignment that maximizes a score. The algorithm used is a straightforward generalization of the global alignment algorithm presented in the section Algorithms for the comparison of two sequences. This is easy to see, in particular, for the case of columnoriented scoring function avoiding affine gap penalty in favor of the simpler linear one. With this scoring, the arrangment of gaps and letters in a column can be represented by a boolean vector indicating which sequences contain a gap in a particular column. Given the letters that are being compared, one needs to evaluate the scores for all these arrangements. However, the computational complexity of this algorithm is rather forbidding. For n sequences it is proportional to 2n times the product of the lengths of all sequences.
In practice this algorithm can only be run for a modest number of sequences being compared.
There exists software to compare three sequences with this algorithm
that additionally implements a spacesaving technique.
For more than three sequences algorithms have been developed that
aim at reducing the search space while still optimizing the given scoring function.
The most prominent program of this kind is MSA2 [1].
An alternative approach is used by DCA [2]
which implements a divideandconquer philosophy. The search space is repeatedly
subdivided by identifying strongholds for the alignment.
For a more detailed description of these concepts, also look at the section on
Algorithms for SPoptimal multiple alignments.
exercises None of these approaches, however, would work independent of the number of sequences to be aligned. The most common remedy is reducing the multiple alignment problem to an iterated application of the pairwise alignment algorithm. However, in doing so, one also aims at drawing on the increased amount of information contained in a set of sequences. Instead of simply merging pairwise alignments of sequences, the notion of a profile has been introduced in order to grasp the conservation patterns within subgroups of sequences. A profile is essentially a representation of an already computed multiple alignment of a subgroup. This alignment is "frozen" for the remaining computation. Other sequences or other profiles can be compared to a given profile based on a generalized scoring scheme defined for this purpose. Two such schemes are in use, one based on average scores and one based on information theoretic score [3]. Note that profile scoring schemes respect conservation patterns. Given a profile and a single sequence, the two can be aligned using the basic dynamic programming algorithm together with the accompanying scoring scheme. The result will be an alignment between the two that can readily be converted into a multiple alignment now comprising the sequences underlying the profile plus the new one. Likewise, two profiles can be aligned with each other resulting in a multiple alignment containing all sequences from both profiles. With these tools various multiple alignment strategies can be implemented. Most commonly, a hierachical tree is generated for the given sequences which is then used as a guide for iterative profile construction and alignment. The construction of such a tree is described in the section Phylogenetic Trees and Multiple Alignments.
The above alignment strategy was introduced in papers by Taylor, Barton, Corpet, and Higgins.
Higgins' program Clustal [4] has meanwhile become the de facto standard for multiple sequence alignment.
Another program in use is Dialign [5].
exercises Dialign is different in that it aims at the delineation of regions of similarity among the given sequences. Since iterative profile alignment tends to be guided by a hierachical tree, this step of the computation is also influencing the final result. Usually this tree is computed based on pairwise comparisons and the resulting scores. Subsequently this score matrix is used as input to a clustering procedure like single linkage clustering or UPGMA [6]. However, it is well understood that in an evolutionary sense such a hierarchic clustering does not necessarily result in a biologically valid tree. Thus, when allowing this tree to determine the multiple alignment there is the danger of directing further evolutionary analysis of this alignment into the wrong direction. Consequently, the question has arisen of a common formulation of evolutionary reconstruction and multiple sequence alignment.
The cleanest although biologically somewhat simplistic model attempts to reconstruct
ancestral sequences to attribute to the inner nodes of a tree.
Such reconstructed sequences at the same time determine the multiple alignment
among the sequences.
In this generalized tree alignment one aims at minimizing the sum
of the edges of this tree, where each edge is annotated with the alignment
distance between the sequences at its incident nodes.
As to be expected, the computational complexity of this problem
again makes its solution unpractical but approximation algorithms are known.
The practical effort in this direction go back to the work of Sankoff [7].
Jotun Hein [8], Schwikowski and Vingron [9], and Tao Jiang [10]
produced programs relying on these ideas.
Comments are very welcome. luz@molgen.mpg.de 