previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Algorithms for the comparison of two sequences


This section provides the reader with a formal notation of the algorithms to compare two sequences, which are based on dynamic programming. Furthermore gap-functions, score- and distance-functions are discussed in detail.

Introduction: Type I and type II alignments

Consider the following example of an alignment between two sequences "RDISLVKNAGI" and "RNILVSDAKNVGI"


R D I S L V - - - K N A G I 
R N I - L V S D A K N V G I

As before dashes represent insertions or deletions (shortly called indels or gaps). The order in which the residues occur in the alignment is identical to the one in their respective sequences. Two residues that are printed in the same column are called matched.

An alternative alignment for the first three pairs could be . As allowing for this second variant has consequences on the formulation of algorithms, some authors distinguish two types of alignments. Kruskal [Kru83] calls them trace and alignment, we want to use "type I" and "type II" alignment because we view both versions as having equal rights to be called alignment. In simple terms a type I alignment is one that does not allow for adjacent gaps in opposite sequences, whereas a type II alignment does. A type I alignment can therefore be represented as a sequence of successive residue pairs that do not violate the order of the residues in the sequences. In at least one of the sequences no residue may be skipped when going to the next pair. The above example seen as a type I alignment can be expressed by listing the index-pairs for the matched residues:


( (1,1), (2,2), (3,3), (5,4), (6,5), (7,9), (8,10), (9,11), (10,12), (11,13))


E.g. (5,4) means that the 5th residue of the first sequence is matched with the 4th residue of the second sequence. This representation will be chosen to define the type I alignment. To describe a type II alignment such a representation does not suffice. In the case of for example the pairs (1,1),(3,3) do not describe the alignment properly. Instead it is necessary to also represent the gap character, say by the index 0. One can use the notation ( (1,1),(2,0),(0,2),(3,3) ) which exactly describes the given type II alignment. A type II alignment is therefore defined as any sequence of pairs (following the order of residues in the sequences) that may contain either a residue or a gap-character. Pairing two gap-characters does not make sense in the context of sequence comparison and therefore is forbidden in the alignmnent. Type II alignments are richer than type I in that some type II alignments cannot be described in the type I formalism. If one restricts type II alignments to those that do not contain two adjacent columns where one has a gap-character in one sequence and the adjacent one in the other column, then each such alignment corresponds to a type I alignment. The distinction between these two kinds of alignments will be important to explain some of the differences among algorithms for sequence comparison as they are given in the literature.

In the case of the 20 amino acids, identical matched residues are not as frequent as in DNA. Consequently weighting schemes have been developed which attribute a value to a pair of matched amino acids. One such scheme was devised by M. Dayhoff [DBH83] and is based on exchange frequencies between amino acids. This 20 x 20 matrix attributes different positive values (ranging from +2 to +17) to exact matches and values between -8 and +7 for mismatches. The score of an alignment is then made up of the weights for the matching pairs in the alignment minus a penalty for every gap introduced. The gap penalty will in general be a function g of the length of the gap. The example above scored by the Dayhoff matrix would give:


The intention behind such a scoring scheme is that the alignment which optimizes this score should best represent the biological similarity between the two sequences. Finding such an optimal alignment is the central task in sequence comparison.

Most of the subsequent algorithms are based on the representation of an alignment as a path in a comparison matrix as shown in the below. The two kinds of alignment lead to two slightly different versions of this representation. For a type I alignment the grid points of the comparison matrix are thought of as labeled with the corresponding residue pair. A path may move from one grid point to any one to its bottom right without skipping more than one row or column simultaneously. The possible arcs from each point are depicted in the top right of the matrix. For a type II alignment there are only three moves allowed which are shown in the top left of the right matrix. Interpreting these as matching the residues in a row/column (diagonal arc), deleting the residue corresponding to the column of an arc (horizontal move) or deleting the residue corresponding to the row of an arc (vertical move) yields exactly the type II alignments corresponding to a path in this matrix.





Interpreting these as matching the residues in a row/column (diagonal arc), deleting the residue corresponding to the column of an arc (horizontal move) or deleting the residue corresponding to the row of an arc (vertical move) yields exactly the type II alignments corresponding to a path in this matrix. The following sections will define the two types of alignments and their scores. Then for each of them a directed graph will be introduced according the above description together with a one-to-one mapping between alignments and paths in the graph. The arcs will be given weights such that under the mapping the length of a path is just the score of the corresponding alignment. The optimal path in such a graph therefore also defines the optimal alignment of the sequences. Using this framework we can give a common description and proof to the algorithms introduced in the literature ([NeW70], [San72], [Sel74], [WSB76], [Got82], [Wat84a]).


Comments are very welcome.
luz@molgen.mpg.de