Online Lectures on Bioinformatics

Algorithms for the comparison of two sequences
Type I alignment
Let
be a finite alphabet and
and
be a finite word (a socalled sequence) over this alphabet.
Definition: Given , the socalled gap function, and the weight function on the residue pairs, the score s of a type I alignment is defined as with When an alignment does not begin in (1,1) and end in (s, t), there are gaps at the left and/or right end of the alignment. Mathematically there is no reason to treat these any different from the gaps in the interior of the alignment. It is, however, often the case that one of the two sequences compared is known to correspond only to a part of the other one. In such a case attributing a penalty to these endgaps is not intended. The above definition of alignment score leaves endgaps without penalty. Alternatively one can define the score in such a way that gaps at the ends of the sequences are penalized just like any other one: A typical gap function ("gap penalty") would, e.g., be g(n) = a + b(n2) with a > b. The function w normally is a matrix like the Dayhoffmatrix [DBH83] mentioned above.
The biological task of aligning two sequences can now be stated as the problem
of finding the maximal score that an alignment can achieve as well as
identifying at least one alignment where this score is assumed.
To achieve this,
a type I alignment is represented as a path in a certain directed graph
to be called a type I graph.
The vertices of the type I graph are the points of the lattice
{1, ..., s} x {1, ..., t}
plus {(0,0), (s+1, t+1)}.
The latter two points function as source and sink, i.e. from (0,0) there are arcs to all points (i,j) with
and from any point
at the lower or right margin of the grid there is an arc to the sink (s+1, t+1).
In general arcs point from a node to any other one that could follow in an
alignment, that
is from (i,j) to (i+k,j+1), (i+1,j+1) and (i+1,j+k) (where k > 1 and
neither i+k nor j+k exceed s and t
and
respectively).
With the exception of the arcs from the source or into the sink, weights
e((x_{i},y_{i}),(x_{j},y_{j}))
are assigned to an edge by the rule:
The weights for the "starting" and "ending" edges depend on whether or not endgaps are penalized. If they are, an edge from (0,0) to (i,j) is given w(i,j)  G((0,0),(i,j)) and an arc leading to is given . If endgaps should be free, then the first kind will receive w(i,j) and the ones into the sink will be attributed 0.
Under these conventions there is a onetoone mapping between a type I
alignment and a path in a type I graph. The path corresponding to a given
alignment is made up of the vertices with the same
indices as the residue pairs
in the alignment. The source and sink as first
and last elements of the path are not reflected in the alignment.
With the above definitions the score of the alignment is the score of
the corresponding path. Therefore a standard algorithm to find an optimal path
in a directed acyclic graph with weights attached to the edges
when applied to the type I graph will provide also an optimal type I alignment.
We will formulate the corresponding algorithm for the case of an alignment
score that does not attribute a penalty to endgaps.
proceeding first along the first row, then the second etc. The maximal element in the last row and column is the score of the alignment. An alignment corresponding to that score can be recovered by backtracking from this point to the one that gave rise to the maximal value and proceeding that way until the first row or column is reached. Proof: We use the algorithm to find an optimal path in a directed graph that was introduced above. Let the vertices v_{i} of the graph be numbered such that an edge always points to a higher numbered vertex. A length l_{ij} is associated to an edge pointing from v_{i} to v_{j}. Let l(v_{i}) denote the length of the optimal path from v_{1} to v_{i}. Then
l(v_{i}) = max (l(v_{j}) + l_{ij}),
where the maximum is taken over all j such that v_{j} is a predecessor of v_{i}. This can be translated into the setting of the type I graph. Let the nodes be ordered (1,1), (1,2), ..., (1, t), (2,1),..., (s, t). The predecessors to a node (i,j) are all (ik,j1), (i1,j1) and (i1,jk). For the nodes (i,j) with we define l(i,j) = w(s_{xi},t_{yj}) to reflect the fact that end gaps are not penalized. Then for When this computation is done for all vertices of the graph in the order defined above the optimal scores are sequentially assigned until, when the last node is reached, the final score is known. To also calculate an optimal path (not necessarily the only one) with its equivalent alignment another array of pointers has to be kept. There at every assignment of the maximal value to l(i,j) the cell that gave rise to this maximum is stored. Backtracking from a point (i,j) along the path specified leads to the alignment that had given rise to the score in the point (i,j). As we are not penalizing endgaps the edges leading into the sink are irrelevant. Therefore backtracking starts from the maximal element in the margin ((i, t) or (s, j)) of the array. q.e.d. This algorithm corresponds to the one proposed by Needleman and Wunsch [NeW70]. The forward part of this algorithm visits each cell of the array once. In each of these steps approximately i+j values are checked, which will certainly be less than s + t operations. Therefore when n is the longer of the two sequences the calculation of the alignment score is of time complexity O(n^{3}). The backtracking can at most involve s + t steps.
Comments are very welcome. luz@molgen.mpg.de 