In the current context we can only give an extremely brief introduction
to the basic notions of molecular biology. An overview can be found in
any modern textbook on biology, biochemistry or molecular biology
(e.g. [ABL89], [Str88]). [Goa86] is a short review of computational
methods in biological sequence analysis and recently several books summarizing
problems and methods have been published ([Doo86], [Hei87], [Les88]).
DNA (deoxyribonucleic acid) and proteins are biological macromolecules
built as long linear chains of chemical components.
In the case of DNA these components are the so-called nucleotides,
of which there are four differents ones, each denoted by one
of the letters A, C, G and T.
Proteins are made up of 20 different amino acids (or "residues") which
are denoted by 20 different letters of the alphabet.
the twenty amino acids
DNA plays a fundamental role in the processes of life in two respects. First it contains the templates for the synthesis of proteins, which are essential molecules for any organism. Though being summarized under that one name there is a wide variety of proteins. What they have in common are their building blocks, the amino acids. Each of the 20 amino acids is coded for by one or more triplets of the nucleotides making up DNA. The end of a chain is coded for by another set of triplets (or codons, as they are also called). Based on this translation table (the genetic code) the linear string of DNA is translated into a linear string of amino acids, i.e. a protein. Here is an example:
The amino acid sequence of a protein, also called its primary structure,
is only one level at which it can be looked at.
To fulfill its natural role a protein assumes a certain three
dimensional structure, which is referred to as its tertiary structure.
The term "secondary structure" refers to the local folding of the
amino acid chain into small regular elements. The major classes of secondary
structure are called beta strands and alpha helices.
The three dimensional (tertiary) structure of a protein is usually built up of
elements of alpha and/or beta structure together with loop regions in between
them. It is the three dimensional folding of the chain which determines
the biological function of a protein.
The second role in which DNA is essential to life is as a medium to transmit
information (namely the building plans for proteins)
from generation to generation. Watson and Crick in 1953 found the double
helical structure of DNA. The linear chain does not really occur on its own
but is paired to a complementary strand. The complementarity
stems from the ability of the nucleotides to establish specific pairs (A-T and
G-C). The pair of complementary strands then forms the famous double helix.
Each strand therefore carries the entire information and the biochemical
machinery guarantees that the information can be
copied over and over again even when the "original" molecule has long since
During this process of copying, changes (known as mutations)
are introduced into the DNA sequence.
The kinds of mutations which are important to sequence comparison
are base changes, insertions of nucleotides into the chain and deletions
from the chain. The elementary operations allowed in the definition of
sequence similarity are chosen to correspond to these events.
To visualize the relationship between two simliar sequences they are
represented in the form of an alignment:
V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DL HAHU VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDL HBHU SH-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV HAHU STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV HBHU DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR HAHU DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH HBHU
The two amino acid sequences compared here are the alpha chain of human
(abbreviated HAHU) and its beta chain (HBHU). With the sequences being
approximatly 150 amino acids long,
each block of lines contains part of the first sequence in the upper and
of the second
sequence in the lower line. Residues on top of each other in one block
Some residues are conserved (the amino acids in the column are identical),
some have been exchanged and part of the chain has been deleted from the
one sequence or (equivalently) inserted in the other. Insertions or deletions
are indicated by a letter paired with a dash, the gap-character.
An alignment can also be interpreted as representing the
operations necessary to transform a sequence into another one using
the same operations as evolution does.
There are of course limits to the changes a sequence can accommodate. These limits stem from the fact that the protein a piece of DNA codes for has still to fulfill its biological function, which in general means that its tertiary structure must not be altered. It is today well understood that sequences which agree in more than 50 % of the positions are likely to share the same overall fold [ChL86]. On the other hand, not any change of that quantity will keep a three dimensional structure. The changes occurring naturally have been selected for functioning of the corresponding proteins and in fact among these there are cases known where the sequence similarity among two proteins which share the same overall fold is barely visible. In identifying pairs of sequences which, due to their sequence similarity, are likely candidates of having the same tertiary structure lies the other major motivation for sequence alignment.
Comments are very welcome.