# Pairwise sequence comparison

## Dot plots

Dot plots are probably the oldest way of comparing sequences (Maizel and Lenk). A dot plot is a visual representation of the similarities between two sequenes. Each axis of a rectangular array represents one of the two sequenes to be compared. A window length is fixed, together with a criterion when two sequence windows are deemed to be similar. Whenever one one window in one sequene resembles another a window in the other sequence, a dot or short diagonal is drawn at the corresponding position of the array. Thus, when two sequenes share similarity over their entire length a diagonal line will extend from one corner of the dot plot to the diagonally opposite corner. If two sequences only share patches of similarity this will be revealed by diagonal stretches.

Figure 1 shows an example of a dot plot. There, the alpha chain of human hemoglobin is compared to the beta chain of human hemoglobin. For this computation, the window length was set to 31, matches and dismatches were assigned similarity values of +5 and -4 respectively. The grey values of the dots scale with the similarity of two windows. One can clearly discern a diagonal trace along the entire length of the two sequences. Note the jumps where this trace jumps to another diagonal of the array. These jumps correspond to position where one or the other sequence has more (or less) letters than the other one.

Figure 1: dot plot of two coding DNA-sequences: the alpha chain of human hemoglobin is assigned to the horizontal axis as the beta chain of human hemoglobin is assigned to the vertical axis (see also text). The dot plot was created by means of dotter.

Dot plots are a very powerful method of comparing two sequences. They do not predispose the analyis in any way such that they constitute the ideal first-pass analysis method. Based on the dot plot the user can decide whether he deals with a case of global, i.e. beginning-to-end similarity, or local similarity. Local similarity denotes the existence of similar regions between two sequences that are embedded in the overall sequences which lack similarity. Sequences may contain regions of self-similarity which are frequently termed internal repeats. A dot plot comparison of the sequence will itself will reveal internal repeats by displaying several parallel diagonals.

exercise 1

Instead of simply deciding when two windows are similar a qualitiy function may be defined. In the simplest case, this could be the number of matches in the window. For amino acid sequences the physical relatedness between amino acids may give rise to a quantification of the similarity of two windows. For example, when a similarity matrix on the amino acids (like the Dayhoff matrix, see below) is used one might sum up these values along the window. However, when this similarity matrix contains unequal values for exact matches this leads to exactly matching windows of different quality. The dot plot methods of Argos and Patthy are intricate designs that reflect the physical relatedness of amino acids. The program dotter - which can be downloaded from the EBI ftp server - is an X-windows based program that allows to display dot plots for DNA, for proteins, and for comparison of DNA to protein.