# Pairwise sequence comparison

## Sequence Alignment

A sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin. If the same letter occurs in both sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither). Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences. Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion. Since an insertion in one sequence can always be seen as a deletion in the other one frequently uses the term "indel" [1].

```BANANA-
-ANANAS
```
In such a simple evolutionarily motivated scheme, an alignment mediates the definition of a distance for two sequences. One generally assigns 0 to a match, some negative number to a mismatch and a larger negative number to an indel. By adding these values along an alignment one obtains a score for this alignment. A distance function for two sequences can be defined by looking for the alignment which yields the minimum score. Luckily, using dynamic programing this minimization can be effected without explicitly enumerating all possible alignment of two sequences. The next section(s) will give a formal description of this algorithm. There is also a Java-Applet to visualize the dynamic programming algorithm at this site.

The idea of assigning a score to an alignment and then minimizing over all alignments is at the heart of all biological sequence alignment. However, many more considerations have influenced the definition of the scores and made sequence alignment applicable to a wide range of biological settings. Firstly, note that one may either define a distane or a similarity function to an alignment. The difference lies more in the interpretation of the values. A distance function will define negative values for mismatches or gaps and then aim at minimizing this distance. A similarity function will give high values to matches and low values to gaps and then maximize the resulting score. The basic structure of the algorithm is the same for both cases. In 1981, Smith and Waterman showed that for global alignment, i.e. when a score is computed over the entire length of both sequences, the two concepts are in fact equivalent. Thus, it is now customary to choose the setting that gives more freedom in appropriately modeling the biological setting that one is interested in.

In the similarity framework one can easily distinguish among the different possible mismatches and also among different kinds of matches. For example, a match between two Tryptophanes is usually seen to be more important than a match between two Alanines. For amino acids, scoring matrices have been defined that assign a score to each possible pair of amino acids. A famous such matix is the so-called Dayhoff matrix (Dayhoff, 1978). Other more recent matrices are the BLOSUM series of matrices or Gonnet's matrices. For every matrix one needs to find appropriate penalties for gaps.

The treatment of gaps deserves special care. The famous algorithm by Needleman and Wunsch [2] did not impose any restrictions on the penalty assigned to a gap of a certain length. For reasons of computational speed this was later specified to assigning a cost function linear in the number of deleted (inserted) residues (Sankoff [3], Sellers [4]). This amounts to penalizing every single indel. However, since a single indel tends to be penalized such that it is considerable inferior to a mismatch, this choice resulted in longer gaps being quite expensive and thus unrealistically rare. As a remedy, one mostly uses a gap penalty function which charges a gap open penalty for every gap that is introduced and penalizes the length with a gap extension penalty which is charged for every inserted or deleted letter in that gap. Clearly, this results in an affine linear function in the gap length, frequently written as g(k) = a + b*k.

With the variant of the dynamic programming algorithm first published by Gotoh [5] it became possible to compute optimal alignments with affine linear gap penalties time proportional to the product of the lengths of the two sequences to be aligned. This was a speed-up by one order of magnitude compared to a naive algorithm using this more general gap function. A further breakthrough in alignment algorithms development was an algorithm that could compute an optimal alignment using computer memory only proportional to the length of one sequence instead of their product. This algorithm by Myers and Miller [6] is based on work by Hirschberg.

exercises 2-3