Online Lectures on Bioinformatics

Algorithms for the comparison of two sequencesThis section provides the reader with a formal notation of the algorithms to compare two sequences, which are based on dynamic programming. Furthermore gapfunctions, score and distancefunctions are discussed in detail. Introduction: Type I and type II alignmentsConsider the following example of an alignment between two sequences "RDISLVKNAGI" and "RNILVSDAKNVGI"
As before dashes represent insertions or deletions (shortly called indels or gaps). The order in which the residues occur in the alignment is identical to the one in their respective sequences. Two residues that are printed in the same column are called matched.
An alternative alignment for the first three pairs
could be
.
As allowing for this second
variant has consequences on the formulation of algorithms, some authors
distinguish two types of alignments.
Kruskal [Kru83] calls them trace and alignment,
we want to use "type I" and "type II" alignment because we view both
versions as having equal rights to be called alignment.
In simple terms a type I alignment is one that does not allow for
adjacent gaps in opposite sequences, whereas a type II alignment does.
A type I alignment can therefore be represented as a sequence of successive
residue pairs that do not violate the order of the residues in the sequences.
In at least one of the sequences no residue may be skipped when going to
the next pair. The above example seen as a type I alignment can be expressed by
listing the indexpairs for the matched residues:
E.g. (5,4) means that the 5th residue of the first sequence is matched with the 4th residue of the second sequence. This representation will be chosen to define the type I alignment. To describe a type II alignment such a representation does not suffice. In the case of for example the pairs (1,1),(3,3) do not describe the alignment properly. Instead it is necessary to also represent the gap character, say by the index 0. One can use the notation ( (1,1),(2,0),(0,2),(3,3) ) which exactly describes the given type II alignment. A type II alignment is therefore defined as any sequence of pairs (following the order of residues in the sequences) that may contain either a residue or a gapcharacter. Pairing two gapcharacters does not make sense in the context of sequence comparison and therefore is forbidden in the alignmnent. Type II alignments are richer than type I in that some type II alignments cannot be described in the type I formalism. If one restricts type II alignments to those that do not contain two adjacent columns where one has a gapcharacter in one sequence and the adjacent one in the other column, then each such alignment corresponds to a type I alignment. The distinction between these two kinds of alignments will be important to explain some of the differences among algorithms for sequence comparison as they are given in the literature.
In the case of the 20 amino acids, identical matched
residues are not as frequent as in DNA. Consequently weighting schemes have
been developed which attribute a value to a pair of matched amino acids.
One such scheme was devised by M. Dayhoff [DBH83] and is based on exchange
frequencies between amino acids. This 20 x 20
matrix
attributes different positive values (ranging from +2 to +17)
to exact matches and values between 8 and +7 for mismatches.
The score of an alignment is then
made up of the weights for the matching pairs in the alignment minus a penalty
for every gap introduced. The gap penalty will in general be a function g of the length of the gap.
The example above scored by the Dayhoff matrix would give:
The intention behind such a scoring scheme is that the alignment which optimizes this score should best represent the biological similarity between the two sequences. Finding such an optimal alignment is the central task in sequence comparison.
Most of the subsequent algorithms are based on the representation of
an alignment as a path in a comparison matrix as shown in the below.
The two kinds of alignment lead to two slightly different versions
of this representation.
For a type I alignment the grid points of the comparison matrix
are thought of as labeled with the corresponding residue pair.
A path may move from one grid point to any
one to its bottom right without skipping more than one row or column
simultaneously. The possible arcs from each point are depicted in the
top right of the matrix. For a type II alignment there are only three moves
allowed which are shown in the top left of the right matrix.
Interpreting these
as matching the residues in a row/column (diagonal arc), deleting
the residue corresponding to the column of an arc (horizontal move) or
deleting the residue corresponding to the row of an arc (vertical move)
yields exactly the type II alignments corresponding to a path in this matrix.
Interpreting these
as matching the residues in a row/column (diagonal arc), deleting
the residue corresponding to the column of an arc (horizontal move) or
deleting the residue corresponding to the row of an arc (vertical move)
yields exactly the type II alignments corresponding to a path in this matrix.
The following sections will define the two types of alignments and their scores.
Then for each of them a directed graph will be introduced according the above
description together
with a onetoone mapping between alignments and paths in the graph.
The arcs will be given weights
such that under the mapping the length of a path is just the score of the
corresponding alignment. The optimal path in such a graph therefore also
defines the optimal alignment of the sequences. Using this framework we can
give a common description and proof to the algorithms introduced in the
literature ([NeW70], [San72], [Sel74], [WSB76], [Got82], [Wat84a]).
Comments are very welcome. luz@molgen.mpg.de 