Online Lectures on Bioinformatics

Phylogenetic Trees and Multiple AlignmentsApproximating additive metricsLet a distance matrix on the species be given. In practice, the distance matrix between molecular sequences will not be additive. A tree T is searched whose distance matrix approximates the given one. Although the true difficulty lies in finding the best topology one may formulate two alternative formalizations of the approximation task when the topology is given. Let be the pathedge incidence matrix of the given tree topology where by a path a leaftoleaf path is meant. Each row of matrix corresponds to a path between two species and each column corrsponds to an edge. If an edge lies on a certain path the corresponding matrix entry is 1 and otherwise it is 0. The vector of edgelengths of an edgeweighted tree shall be denoted by a columnvector e. Here the above example again:
The distance matrix on a set of objects constitutes
an additive metric
a tree with topology
,
such that
with
whereat the distance vector d contains the distances between the objects.
As mentioned above, the distance matrices on biological data
generally constitute biased additive metrics.
Thus the general equation for an arbitrary distance vector
e
and for a given tree topology
will rarely be solvable.
The first reaction would be to approximate the solution in a least
squares sense, i.e. to choose edgelengths
e, such that the
euclidian distance between
d and the treelike distances
is minimized [FitchMargoliash]:
Alternatively, Waterman et al. have formulated a linear
program. The objective function of the linear program is similar in spirit
to the Steiner tree setup: the overall length of the tree is minimized. One
constraint is that the edges should have positive lengths. The other
constraint is derived from the fact that the alignment distance between two
sequences underestimates the number of changes that have occurred between
the two sequences during their development. This results in the inequality
,
elementwise, which is the other constraint for the linear
program. Solving this linear program then yields an assignment for edge
lengths under a given topology. Finding the best topology still requires
checking them all.
Comments are very welcome. luz@molgen.mpg.de 