previous section previous page next page next section

Online Lectures on Bioinformatics


Biological preliminaries


In the current context we can only give an extremely brief introduction to the basic notions of molecular biology. An overview can be found in any modern textbook on biology, biochemistry or molecular biology (e.g. [ABL89], [Str88]). [Goa86] is a short review of computational methods in biological sequence analysis and recently several books summarizing problems and methods have been published ([Doo86], [Hei87], [Les88]).

DNA (deoxyribonucleic acid) and proteins are biological macromolecules built as long linear chains of chemical components. In the case of DNA these components are the so-called nucleotides, of which there are four differents ones, each denoted by one of the letters A, C, G and T. Proteins are made up of 20 different amino acids (or "residues") which are denoted by 20 different letters of the alphabet.

table 1.1
the nucleotides

DNA adenine guanine cytosine thymine
RNA adenine guanine cytosine uracil

table 1.2
the twenty amino acids

One-letter code Three-letter-code Name
1 A Ala Alanine
2 C Cys Cysteine
3 D Asp Aspartic Acid
4 E Glu Glutamic Acid
5 F Phe Phenylalanine
6 G Gly Glycine
7 H His Histidine
8 I Ile Isoleucine
9 K Lys Lysine
10 L Leu Leucine
11 M Met Methionine
12 N Asn Asparagine
13 P Pro Proline
14 Q Gln Glutamine
15 R Arg Arginine
16 S Ser Serine
17 T Thr Threonine
18 V Val Valine
19 W Trp Tryptophan
20 Y Tyr Tyrosine

DNA plays a fundamental role in the processes of life in two respects. First it contains the templates for the synthesis of proteins, which are essential molecules for any organism. Though being summarized under that one name there is a wide variety of proteins. What they have in common are their building blocks, the amino acids. Each of the 20 amino acids is coded for by one or more triplets of the nucleotides making up DNA. The end of a chain is coded for by another set of triplets (or codons, as they are also called). Based on this translation table (the genetic code) the linear string of DNA is translated into a linear string of amino acids, i.e. a protein. Here is an example:


The amino acid sequence of a protein, also called its primary structure, is only one level at which it can be looked at. To fulfill its natural role a protein assumes a certain three dimensional structure, which is referred to as its tertiary structure. The term "secondary structure" refers to the local folding of the amino acid chain into small regular elements. The major classes of secondary structure are called beta strands and alpha helices. The three dimensional (tertiary) structure of a protein is usually built up of elements of alpha and/or beta structure together with loop regions in between them. It is the three dimensional folding of the chain which determines the biological function of a protein.


The second role in which DNA is essential to life is as a medium to transmit information (namely the building plans for proteins) from generation to generation. Watson and Crick in 1953 found the double helical structure of DNA. The linear chain does not really occur on its own but is paired to a complementary strand. The complementarity stems from the ability of the nucleotides to establish specific pairs (A-T and G-C). The pair of complementary strands then forms the famous double helix. Each strand therefore carries the entire information and the biochemical machinery guarantees that the information can be copied over and over again even when the "original" molecule has long since vanished.

During this process of copying, changes (known as mutations) are introduced into the DNA sequence. The kinds of mutations which are important to sequence comparison are base changes, insertions of nucleotides into the chain and deletions from the chain. The elementary operations allowed in the definition of sequence similarity are chosen to correspond to these events. To visualize the relationship between two simliar sequences they are represented in the form of an alignment:




The two amino acid sequences compared here are the alpha chain of human hemoglobin (abbreviated HAHU) and its beta chain (HBHU). With the sequences being approximatly 150 amino acids long, each block of lines contains part of the first sequence in the upper and of the second sequence in the lower line. Residues on top of each other in one block are equivalenced. Some residues are conserved (the amino acids in the column are identical), some have been exchanged and part of the chain has been deleted from the one sequence or (equivalently) inserted in the other. Insertions or deletions are indicated by a letter paired with a dash, the gap-character. An alignment can also be interpreted as representing the operations necessary to transform a sequence into another one using the same operations as evolution does.

There are of course limits to the changes a sequence can accommodate. These limits stem from the fact that the protein a piece of DNA codes for has still to fulfill its biological function, which in general means that its tertiary structure must not be altered. It is today well understood that sequences which agree in more than 50 % of the positions are likely to share the same overall fold [ChL86]. On the other hand, not any change of that quantity will keep a three dimensional structure. The changes occurring naturally have been selected for functioning of the corresponding proteins and in fact among these there are cases known where the sequence similarity among two proteins which share the same overall fold is barely visible. In identifying pairs of sequences which, due to their sequence similarity, are likely candidates of having the same tertiary structure lies the other major motivation for sequence alignment.

Comments are very welcome.