Online Lectures on Bioinformatics
|
Algorithms for the comparison of two sequencesThis section provides the reader with a formal notation of the algorithms to compare two sequences, which are based on dynamic programming. Furthermore gap-functions, score- and distance-functions are discussed in detail. Introduction: Type I and type II alignmentsConsider the following example of an alignment between two sequences "RDISLVKNAGI" and "RNILVSDAKNVGI"
As before dashes represent insertions or deletions (shortly called indels or gaps). The order in which the residues occur in the alignment is identical to the one in their respective sequences. Two residues that are printed in the same column are called matched.
An alternative alignment for the first three pairs
E.g. (5,4) means that the 5th residue of the first sequence is matched with the 4th residue of the second sequence. This representation will be chosen to define the type I alignment. To describe a type II alignment such a representation does not suffice. In the case of
In the case of the 20 amino acids, identical matched
residues are not as frequent as in DNA. Consequently weighting schemes have
been developed which attribute a value to a pair of matched amino acids.
One such scheme was devised by M. Dayhoff [DBH83] and is based on exchange
frequencies between amino acids. This 20 x 20
matrix
attributes different positive values (ranging from +2 to +17)
to exact matches and values between -8 and +7 for mismatches.
The score of an alignment is then
made up of the weights for the matching pairs in the alignment minus a penalty
for every gap introduced. The gap penalty will in general be a function g of the length of the gap.
The example above scored by the Dayhoff matrix would give:
The intention behind such a scoring scheme is that the alignment which optimizes this score should best represent the biological similarity between the two sequences. Finding such an optimal alignment is the central task in sequence comparison.
Most of the subsequent algorithms are based on the representation of
an alignment as a path in a comparison matrix as shown in the below.
The two kinds of alignment lead to two slightly different versions
of this representation.
For a type I alignment the grid points of the comparison matrix
are thought of as labeled with the corresponding residue pair.
A path may move from one grid point to any
one to its bottom right without skipping more than one row or column
simultaneously. The possible arcs from each point are depicted in the
top right of the matrix. For a type II alignment there are only three moves
allowed which are shown in the top left of the right matrix.
Interpreting these
as matching the residues in a row/column (diagonal arc), deleting
the residue corresponding to the column of an arc (horizontal move) or
deleting the residue corresponding to the row of an arc (vertical move)
yields exactly the type II alignments corresponding to a path in this matrix.
Interpreting these
as matching the residues in a row/column (diagonal arc), deleting
the residue corresponding to the column of an arc (horizontal move) or
deleting the residue corresponding to the row of an arc (vertical move)
yields exactly the type II alignments corresponding to a path in this matrix.
The following sections will define the two types of alignments and their scores.
Then for each of them a directed graph will be introduced according the above
description together
with a one-to-one mapping between alignments and paths in the graph.
The arcs will be given weights
such that under the mapping the length of a path is just the score of the
corresponding alignment. The optimal path in such a graph therefore also
defines the optimal alignment of the sequences. Using this framework we can
give a common description and proof to the algorithms introduced in the
literature ([NeW70], [San72], [Sel74], [WSB76], [Got82], [Wat84a]).
Comments are very welcome. luz@molgen.mpg.de |