Online Lectures on Bioinformatics
|
Algorithms for the comparison of two sequences
Type I alignment
Let Definition: Given
with
When an alignment does not begin in (1,1) and end in (|s|, |t|), there are gaps at the left and/or right end of the alignment. Mathematically there is no reason to treat these any different from the gaps in the interior of the alignment. It is, however, often the case that one of the two sequences compared is known to correspond only to a part of the other one. In such a case attributing a penalty to these end-gaps is not intended. The above definition of alignment score leaves end-gaps without penalty. Alternatively one can define the score in such a way that gaps at the ends of the sequences are penalized just like any other one:
A typical gap function ("gap penalty") would, e.g., be g(n) = a + b(n-2) with a > b. The function w normally is a matrix like the Dayhoff-matrix [DBH83] mentioned above.
The biological task of aligning two sequences can now be stated as the problem
of finding the maximal score that an alignment can achieve as well as
identifying at least one alignment where this score is assumed.
To achieve this,
a type I alignment is represented as a path in a certain directed graph
to be called a type I graph.
The vertices of the type I graph are the points of the lattice
{1, ..., |s|} x {1, ..., |t|}
plus {(0,0), (|s|+1, |t|+1)}.
The latter two points function as source and sink, i.e. from (0,0) there are arcs to all points (i,j) with
The weights for the "starting" and "ending" edges depend on whether or not end-gaps are penalized. If they are, an edge from (0,0) to (i,j) is given w(i,j) - G((0,0),(i,j)) and an arc leading to
Under these conventions there is a one-to-one mapping between a type I
alignment and a path in a type I graph. The path corresponding to a given
alignment is made up of the vertices with the same
indices as the residue pairs
in the alignment. The source and sink as first
and last elements of the path are not reflected in the alignment.
With the above definitions the score of the alignment is the score of
the corresponding path. Therefore a standard algorithm to find an optimal path
in a directed acyclic graph with weights attached to the edges
when applied to the type I graph will provide also an optimal type I alignment.
We will formulate the corresponding algorithm for the case of an alignment
score that does not attribute a penalty to end-gaps.
proceeding first along the first row, then the second etc. The maximal element in the last row and column is the score of the alignment. An alignment corresponding to that score can be recovered by backtracking from this point to the one that gave rise to the maximal value and proceeding that way until the first row or column is reached. Proof: We use the algorithm to find an optimal path in a directed graph that was introduced above. Let the vertices vi of the graph be numbered such that an edge always points to a higher numbered vertex. A length lij is associated to an edge pointing from vi to vj. Let l(vi) denote the length of the optimal path from v1 to vi. Then
l(vi) = max (l(vj) + lij),
where the maximum is taken over all j such that vj is a predecessor of vi. This can be translated into the setting of the type I graph. Let the nodes be ordered (1,1), (1,2), ..., (1, |t|), (2,1),..., (|s|, |t|). The predecessors to a node (i,j) are all (i-k,j-1), (i-1,j-1) and (i-1,j-k). For the nodes (i,j) with
When this computation is done for all vertices of the graph in the order defined above the optimal scores are sequentially assigned until, when the last node is reached, the final score is known. To also calculate an optimal path (not necessarily the only one) with its equivalent alignment another array of pointers has to be kept. There at every assignment of the maximal value to l(i,j) the cell that gave rise to this maximum is stored. Backtracking from a point (i,j) along the path specified leads to the alignment that had given rise to the score in the point (i,j). As we are not penalizing end-gaps the edges leading into the sink are irrelevant. Therefore backtracking starts from the maximal element in the margin ((i, |t|) or (|s|, j)) of the array. q.e.d. This algorithm corresponds to the one proposed by Needleman and Wunsch [NeW70]. The forward part of this algorithm visits each cell of the array once. In each of these steps approximately i+j values are checked, which will certainly be less than |s| + |t| operations. Therefore when n is the longer of the two sequences the calculation of the alignment score is of time complexity O(n3). The backtracking can at most involve |s| + |t| steps.
Comments are very welcome. luz@molgen.mpg.de |