previous section previous page next page next section

Online Lectures on Bioinformatics


Phylogenetic Trees and Multiple Alignments

Approximating additive metrics

Let a distance matrix on the species be given. In practice, the distance matrix between molecular sequences will not be additive. A tree T is searched whose distance matrix approximates the given one. Although the true difficulty lies in finding the best topology one may formulate two alternative formalizations of the approximation task when the topology is given.

Let $ \mathbb{A} $ be the path-edge incidence matrix of the given tree topology where by a path a leaf-to-leaf path is meant. Each row of matrix $ \mathbb{A} $ corresponds to a path between two species and each column corrsponds to an edge. If an edge lies on a certain path the corresponding matrix entry is 1 and otherwise it is 0. The vector of edge-lengths of an edge-weighted tree shall be denoted by a column-vector e.

Here the above example again:


The distance matrix on a set of objects constitutes an additive metric $ \quad \Longleftrightarrow \quad \exists $ a tree with topology $ \mathbb{A} $, such that $ \exists \quad \textbf{e} $ with

\begin{displaymath}\mathbb{A}\ \textbf{e} \ = \ \textbf{d} \ , \end{displaymath}

whereat the distance vector d contains the distances between the objects.

As mentioned above, the distance matrices on biological data generally constitute biased additive metrics. Thus the general equation for an arbitrary distance vector e and for a given tree topology $ \mathbb{A} $ will rarely be solvable. The first reaction would be to approximate the solution in a least squares sense, i.e. to choose edge-lengths e, such that the euclidian distance between d and the tree-like distances $ \mathbb{A}\ \textbf{e} $ is minimized [Fitch-Margoliash]:

\begin{displaymath}\Vert \ \mathbb{A}\ \textbf{e} \ - \ \textbf{d} \ \Vert \ \to \ min \end{displaymath}

Alternatively, Waterman et al. have formulated a linear program. The objective function of the linear program is similar in spirit to the Steiner tree setup: the overall length of the tree is minimized. One constraint is that the edges should have positive lengths. The other constraint is derived from the fact that the alignment distance between two sequences underestimates the number of changes that have occurred between the two sequences during their development. This results in the inequality $ \mathbb{A}\ \textbf{e} \ \geq \textbf{d} $, elementwise, which is the other constraint for the linear program. Solving this linear program then yields an assignment for edge lengths under a given topology. Finding the best topology still requires checking them all.

Comments are very welcome.