previous section previous page next page next section

Online Lectures on Bioinformatics


Algorithms for the comparison of two sequences

Type I alignment

Let be a finite alphabet and and be a finite word (a so-called sequence) over this alphabet.
An ordered set of index-pairs with and is a type I alignment if

Given , the so-called gap function, and the weight function on the residue pairs, the score s of a type I alignment is defined as


When an alignment does not begin in (1,1) and end in (|s|, |t|), there are gaps at the left and/or right end of the alignment. Mathematically there is no reason to treat these any different from the gaps in the interior of the alignment. It is, however, often the case that one of the two sequences compared is known to correspond only to a part of the other one. In such a case attributing a penalty to these end-gaps is not intended. The above definition of alignment score leaves end-gaps without penalty. Alternatively one can define the score in such a way that gaps at the ends of the sequences are penalized just like any other one:

A typical gap function ("gap penalty") would, e.g., be g(n) = a + b(n-2) with a > b. The function w normally is a matrix like the Dayhoff-matrix [DBH83] mentioned above.

The biological task of aligning two sequences can now be stated as the problem of finding the maximal score that an alignment can achieve as well as identifying at least one alignment where this score is assumed. To achieve this, a type I alignment is represented as a path in a certain directed graph to be called a type I graph. The vertices of the type I graph are the points of the lattice {1, ..., |s|} x {1, ..., |t|} plus {(0,0), (|s|+1, |t|+1)}. The latter two points function as source and sink, i.e. from (0,0) there are arcs to all points (i,j) with and from any point at the lower or right margin of the grid there is an arc to the sink (|s+1|, |t+1|). In general arcs point from a node to any other one that could follow in an alignment, that is from (i,j) to (i+k,j+1), (i+1,j+1) and (i+1,j+k) (where k > 1 and neither i+k nor j+k exceed |s| and |t| and respectively). With the exception of the arcs from the source or into the sink, weights e((xi,yi),(xj,yj)) are assigned to an edge by the rule:

The weights for the "starting" and "ending" edges depend on whether or not end-gaps are penalized. If they are, an edge from (0,0) to (i,j) is given w(i,j) - G((0,0),(i,j)) and an arc leading to is given . If end-gaps should be free, then the first kind will receive w(i,j) and the ones into the sink will be attributed 0.

Under these conventions there is a one-to-one mapping between a type I alignment and a path in a type I graph. The path corresponding to a given alignment is made up of the vertices with the same indices as the residue pairs in the alignment. The source and sink as first and last elements of the path are not reflected in the alignment. With the above definitions the score of the alignment is the score of the corresponding path. Therefore a standard algorithm to find an optimal path in a directed acyclic graph with weights attached to the edges when applied to the type I graph will provide also an optimal type I alignment. We will formulate the corresponding algorithm for the case of an alignment score that does not attribute a penalty to end-gaps.

Theorem: (Algorithm A-I)
The score of the optimal type I alignment between two sequences s and t can be calculated by successively computing the entries of an array , L(1,j) = w(1,j) and L(i,1) = w(i,1) under the rule

proceeding first along the first row, then the second etc. The maximal element in the last row and column is the score of the alignment. An alignment corresponding to that score can be recovered by backtracking from this point to the one that gave rise to the maximal value and proceeding that way until the first row or column is reached.
We use the algorithm to find an optimal path in a directed graph that was introduced above. Let the vertices vi of the graph be numbered such that an edge always points to a higher numbered vertex. A length lij is associated to an edge pointing from vi to vj. Let l(vi) denote the length of the optimal path from v1 to vi. Then

l(vi) = max (l(vj) + lij),

where the maximum is taken over all j such that vj is a predecessor of vi. This can be translated into the setting of the type I graph. Let the nodes be ordered (1,1), (1,2), ..., (1, |t|), (2,1),..., (|s|, |t|). The predecessors to a node (i,j) are all (i-k,j-1), (i-1,j-1) and (i-1,j-k). For the nodes (i,j) with we define l(i,j) = w(sxi,tyj) to reflect the fact that end gaps are not penalized. Then for

When this computation is done for all vertices of the graph in the order defined above the optimal scores are sequentially assigned until, when the last node is reached, the final score is known. To also calculate an optimal path (not necessarily the only one) with its equivalent alignment another array of pointers has to be kept. There at every assignment of the maximal value to l(i,j) the cell that gave rise to this maximum is stored. Backtracking from a point (i,j) along the path specified leads to the alignment that had given rise to the score in the point (i,j). As we are not penalizing end-gaps the edges leading into the sink are irrelevant. Therefore backtracking starts from the maximal element in the margin ((i, |t|) or (|s|, j)) of the array. q.e.d.
This algorithm corresponds to the one proposed by Needleman and Wunsch [NeW70].

The forward part of this algorithm visits each cell of the array once. In each of these steps approximately i+j values are checked, which will certainly be less than |s| + |t| operations. Therefore when n is the longer of the two sequences the calculation of the alignment score is of time complexity O(n3). The backtracking can at most involve |s| + |t| steps.

Comments are very welcome.