previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Algorithms for the comparison of two sequences


General gap functions for type II alignments

Biologically it is important to penalize a long gap not as the sum of single gaps because it is likely to be due to one evolutionary event. The above algorithm for type II alignments falls short of doing so. In fact the paradigm of the type II graph breaks down when gap functions are introduced. First the set of edges would have to be extended to include long horizontal or vertical edges (from (i,j) to (i,j+k) and to (i+k,j)). The attribution of weights to these edges, however, is ambiguous as it depends on the preceding edge in a path. If, e.g., a vertical edge was preceded by another vertical edge, this corresponds to a longer gap, the weight of which is not necessarily the sum of the two edges.

In [WSB76] it is shown that in spite of this complication an algorithm similar to the one for single indels exists to calculate an optimal type II alignment with more general gap functions. They proved that the following algorithm using a gap function g with produces the optimal alignment ([WSB76], [Wat84b]):
Algorithm A-II:


Without having to give the exact (rather lengthy) proof for this algorithm here, there still is a way to motivate the above formula for a more restricted class of biologically reasonable gap functions. When assessing the score of an alignment a gap will always be seen as one entity of maximal size and not as two (or more) adjacent gaps of smaller size. The penalty for the gap should therefore never exceed the sum of the penalties given to its parts. Mathematically this leads to the requirement of subadditivity:


This condition implies that the maximal vertical or horizontal edge out of those that make up a certain gap is the cheapest one and therefore will be chosen by the optimization procedure. The problem of ambiguities in the assignment of weights to the edges in the graph is hereby circumvented. For a gap function of the form


subadditivity is certainly fulfilled as


Such a gap function will be called linear. Waterman [Wat84a] introduces also subadditive gap functions.

The explicit algorithm for linear gap functions can now be written in the following form ([Got82], [Wat84a]):


with


This algorithm then involves inspection of a row and column up to the current point for every matrix entry. Gotoh [Got82] showed that this step can be abbreviated to comparison of one new and one old value per row/column. This is due to the fact that the penalties for the gaps of various length all change by the same additive constant when the algorithm proceeds from cell (i,j) to (i+1,j) or (i,j+1).


We can therefore formulate the algorithm to calculate a type II alignment under linear gap penalties as follows.
Algorithm A-II-1:


where


Checking only 5 values at every matrix entry this algorithm has a time complexity of O(n2) just like algorithm A-II-s. However, it has the great advantage of allowing for the more useful linear gap penalty functions. This has made it the most commonly used form of sequence alignment algorithm in molecular biology. The same improvement that led to the quadratic complexity here can also be applied to algorithm A-I when linear gap penalties are used.


Alignment Quiz: 2.
Alignment Quiz: 2.


Dynamic Programming in C
Dynamic Programming in C



Comments are very welcome.
luz@molgen.mpg.de