Maximum parsimony
From Wikipedia, the free encyclopedia
Contents |
[edit] Introduction
Maximum parsimony, often simply referred to as "parsimony," is a commonly used, non-parametric statistical method for estimating phylogenies. It is part of a class of character-based tree estimation methods which use a matrix of discrete phylogenetic characters to infer one or more optimal phylogenetic trees for a set of taxa (commonly a set of species or reproductively-isolated populations of a single species). These methods operate by evaluating candidate phylogenetic trees according to an explicit optimality criterion; the tree with the most favorable score is taken as the best estimate of the phylogenetic relationships of the included taxa. Maximum parsimony is used with most kinds of phylogenetic data; until recently, it was the only widely-used character-based tree estimation method used for morphological data.
Estimating phylogenies is a non-trivial problem. A huge number of possible phylogenetic trees exist for any reasonably sized set of taxa (say ten). These must be searched to find a tree that best fits the data according to the optimality criterion. However, the data themselves do not lead to a simple, arithmatic solution to the problem. Ideally, we would expect the distribution of whatever evolutionary characters (phenotypic traits, alleles, etc.) to directly following the branching pattern of evolution. Thus we could say that if two organisms possess a shared character, they should be more closely related to each other than to a third organism that lacks this character (provided that character was not present in the last common ancestor of all three, in which case it would be a symplesiomorphy). We would predict that bats and monkeys are more closely related to each other than either is to a fish, because they both possess hair (a synapomorphy). However, we cannot say that bats and monkeys are more closely related to one another than they are to whales because they share hair, because we believe the last common ancestor of the three had hair (see synapomorphy).
However, the well understood phenomena of convergent evolution and evolutionary reversals (collectively termed homoplasy) add an unpleasant wrinkle to the problem of estimating phylogeny. For a number of reasons, two organisms can possess a trait not present in their last common ancestor: if we naively took the presence of this trait as evidence of a relationship, we would reconstruct an incorrect tree. Real phylogenetic data include subtantial homoplasy, with different parts of the data suggesting sometimes very different relationships. Methods used to estimate phylogenetic trees are explicitly intended to resolve the conflict within the data by picking the phylogenetic tree that is the best fit to all the data overall, accepting that some data simply will not fit.
Data that do not fit a tree perfectly are not simply "noise," they can contain relevant phylogenetic signal in some parts of a tree, even if they conflict with the tree overall. In the whale example given above, the lack of hair in whales is homoplastic: it reflects a return to the condition present in ancient ancestors of mammals, who lacked hair. This similarity between whales and ancient mammal ancestors is in conflict with the tree we accept, since it implies that the mammals with hair should form a group excluding whales. However, among the whales, the reversal to hairlessness actually correctly associates the various types of whales (including dolphins and porpoises) into the group Cetacea. Still, the determination of the best-fitting tree, and thus which data do not fit the tree is a complex process. Maximum parsimony is one method developed to do this.
[edit] Character Data
The input data used in a maximum parsimony analysis is in the form of "characters" for a range of taxa. There is no generally agreed-upon definition of a phylogenetic character, but operationally a character can be thought of as an attribute, an axis along which taxa are observed to vary. These attributes can be physical (morphological), molecular/ genetic, physiological, behavioral, etc. The only widespread agreement on characters seems to be that variation used for character analysis should reflect heritable variation. Whether it must be directly heritable, or whether indirect inheritance (e.g., learned behaviors) is acceptable, is not entirely resolved.
Each character is divided into discrete character states, into which the variations observed are classified. Character states are often formulated as descriptors, describing the condition of the character substrate. For example, the character "eye color" might have the states "blue" and "brown." Characters can have two or more states (they can have only one, but these characters lend nothing to a maximum parsimony analysis, and are often excluded).
Coding characters for phylogenetic analysis is not an exact science, and there are numerous issues. Typically, taxa are scored with the same state if they are more similar to one another in that particular attribute than each is to taxa scored with a different state. This is not straightforward when character states are not clearly delineated, or when they fail to capture all of the possible variation in a character. How would one score the previously mentioned character for a taxon (or individual) with hazel eyes? Or green? As noted above, character coding is generally based on similarity: hazel and green eyes might be lumped with blue because they are more similar to that color (being light), and the character could be then recoded as "eye color: light; dark." Alternately, there can be multistate characters, such as "eye color: brown; hazel, blue; green."
Ambiguities in character state delineation and scoring can be a major source of confusion, dispute, and error in phylogenetic analysis using character data. Note that, in the above example, "eyes: present; absent" is also a possible character, which creates issues because "eye color" is not applicable if eyes are not present. For such situations, a "?" ("unknown") is scored, although sometimes "X" or "-" (the latter usually in sequence data) are used to distinguish cases where a character cannot be scored from a case where the state is simply unknown. Current implementations of maximum parsimony generally treat unknown values in the same manner: the reasons the data are unknown have no particular effect on analysis. Effectively, the program treats a ? as if it held the state that would involve the fewest extra steps in the tree (see below), although this is not an explicit step in the algorithm.
Genetic data are particularly amenable to character-based phylogenetic menthods such as maximum parsimony because protein- and nucleotide sequences are naturally discrete: a particular position in a nucleotide sequence can be either adenine, cytosine, guanine, or thymine/ uracil, or a sequence gap; a position (residue)in a protein sequence will be one of the basic amino acids or a sequence gap. Thus, character scoring is rarely ambiguous, except in cases where sequencing methods fail to produce a definitive assignment for a particular sequence position. Sequence gaps are sometimes treated as characters, although there is no consensus on how they should be coded.
Characters can be treated as unordered or ordered. For a binary (two-state) character, this makes little difference. For a multi-state character, unordered characters can be thought of as having an equal "cost" (in terms of number of "evolutionary events") to change from any one state to any other; complementarily, they do not require passing through intermediate states. Ordered characters have a particular sequence in which the states must occur through evolution, such that going between some states requires passing through an intermediate. This can be thought of complementarily as having different costs to pass between different pairs of states. In the eye-color example above, it is possible to leave it unordered, which imposes the same evolutionary "cost" to go from brown-blue, green-blue, green-hazel, etc. Alternately, it could be ordered brown-hazel-green-blue; this would normally imply that it would cost two evolutionary events to go from brown-green, three from brown-blue, but only one from brown-hazel. This can also be thought of as requiring eyes to evolve through a "hazel stage" to get from brown to green, and a "green stage" to get from hazel to blue, etc.
There is a lively debate on the utility and appropriateness of character ordering, but no general consensus. Some authorities order characters when there is a clear logical, ontogenetic, or evolutionary transition among the states (for example, "legs: short; medium; long"). Some accept only some of these criteria. Some run an unordered analysis, and order characters that show a clear order of transition in the resulting tree (which practice might be accused of circular reasoning). Some authorities refuse to order characters at all, suggesting that it biases an analysis to require evolutionary transitions to follow a particular path.
It is also possible to apply differential weighting to individual characters. This is usually done relative to a "cost" of 1. Thus, some characters might be seen as more likely to reflect the true evolutionary relationships among taxa, and thus they might be weighted at a value 2 or more; changes in these characters would then count as two evolutionary "steps" rather than one when calculating tree scores (see below). There has been much discussion in the past about character weighting. Most authorities now weight all characters equally, although exceptions are common. For example, allele frequency data is sometimes pooled in bins and scored as an ordered character. In these cases, the character itself is often downweighted so that small changes in allele frequencies count less than major changes in other characters. Also, the third codon position in a coding nucleotide sequence is particularly labile, and is sometimes downweighted, or given a weight of 0, on the assumption that it is more likely to exhibit homoplasy. In some cases, repeated analyses are run, with characters rewieghted in inverse proportion to the degree of homoplaisy discovered in the previous analysis (termed successive weighting); this is another technique that might be considered circular reasoning.
Character state changes can also be weighted individually. This is often done for nucleotide sequence data; it has been empirically determined that certain base changes (A-C, A-T, G-C, G-T, and the reverse changes) occur much less often than others. These changes are therefore often weighted more. As shown above in the discussion of character ordering, ordered characters can be thought of as a form of character state weighting.
[edit] Analysis
A maximum parsimony analysis runs in a very straightforward fashion. Trees are scored according to the degree to which they imply a parsimonious distribution of the character data. The most parsimonious tree for the dataset represents the preferred hypothesis of relationships among the taxa in the analysis.
Trees are scored (evaluated) by using a simple algorithm to determine how many "steps" (evolutionary transitions) are required to explain the distribution of each character. A step is, in essence, a change from one character state to another, although with ordered characters some transitions require more than one step. Contrary to popular belief, the algorithm does not explicitly assign particular character states to nodes (branch junctions) on a tree: the least number of steps can involve multiple, equally costly assignments and distributions of evolutionary transitions. What is optimized is the total number of changes.
There are many many more possible phylogenetic trees than can be searched exhaustively for more than eight taxa or so. A number of algorithms are therefore used to searching among the possible trees. Many of these involve taking an initial tree (usually the favored tree from the last iteration of the algorithm), and perturbing it to see if the change produces a higher score.
The trees resulting from parsimony search are unrooted: they show all the possible relationships of the included taxa, but they lack any statement on relative times of divergence. A particular branch is chosen to root the tree by the user: this branch is then taken to be outside all the other branches of the tree, which together form a monophyletic group. This imparts a sense of relative time to the tree. Incorrect choice of a root can result in incorrect relationships on the tree, even if the tree is itself correct in its unrooted form.
[edit] Problems with Maximum Parsimony Phylogeny Estimation
Maximum parsimony is a very simple approach, and is popular for this reason. However, it is not statistically consistent. That is, it is not guaranteed to produce the true tree with high probability, given sufficient data.
Consistency, here meaning the monotonic convergence on the correct answer with the addition of more data, is a desirable property of any statistical method. As demonstrated in 1978 by Felsenstein, maximum parsimony can be inconsistent under certain conditions. The category of situations in which this is known to occur is called long branch attraction, and occurs, for example, where you have long branches (a high level of substitutions) for two characters (A & C), but short branches for another two (B & D). A and B diverged from a common ancestor, as did C and D.
Assume for simplicity that we are considering a single binary character (it can either be + or -). Because the distance from B to D is small, in the vast majority of all cases, B and D will be the same. Here, we will assume that they are both + (+ and - are assigned arbitrarily and swapping them is only a matter of definition). If this is the case, there are four remaining possibilities. A and C can both be +, in which case all taxa are the same and all the trees have the same length. A can be + and C can be -, in which case only one character is different, and we cannot learn anything, as all trees have the same length. Similarly, A can be - and C can be +. The only remaining possibility is that A and C are both -. In this case, however, we group A and C together, and B and D together. As a consequence, when we have a tree of this type, the more data we collect (i.e. the more characters we study), the more we tend towards the wrong tree.
Several other methods of phylogeny estimation are available, including maximum likelihood, Bayesian phylogeny inference, neighbour joining, and quartet methods. Of these, the first two both use a likelihood function, and, if used properly, are theoretically immune to long-branch attraction. These methods are both parametric, meaning that they rely on an explicit model of character evolution. It has been shown that, for some suboptimal models, these methods can also be inconsistent (refs needed here).
Another complication with maximum parsimony is that finding the most parsimonious tree is an NP-Hard problem. The only currently available, efficient way of obtaining a solution, given an arbitrarily large set of taxa, is by using a heuristic methods which do not guarantee that the most parsimonious tree will be recovered.
[edit] Criticism
It has been asserted (in a previous version of this article) that a major problem, especially for paleontology, is that maximum parsimony "assumes that the only way two species can share the same character is if they are genetically related." Although this statement is confusingly worded, it appears to assert that phylogenetic applications of parsimony assume that all similarity is homologous (other interpretations, such as the assertion that two organisms might NOT be related at all, are nonsensical). This is emphatically not the case: as with any form of character-based phylogeny estimation, parsimony is used to test the homologous nature of similarities by finding the phylogenetic tree which best accounts for all of the similarities.
To use an example cited in a previous version of this article: birds and bats have wings, while crocodiles and humans do not. If this were the only datum available, maximum parsimony would tend to group crocodiles with humans, and birds with bats (as would any other method of phylogenetic inference). We believe that humans are actually more closely related to bats (which are mammals) than crocodiles or birds (which are reptiles). Our belief is founded on additional data that was not considered in the one-character example (using wings). If even a tiny fraction of this additional data, including information on skeletal structure, soft-tissue morphology, integument, behaviour, genetics, etc., were included in the analysis, the faint phylogenetic signal produced by the presence of wings in birds and bats would be overwhelmed by the preponderance of data supporting the (human, bat)(bird, crocodile) tree.
It is often stated that parsimony is not relevant to phylogenetic inference because "evolution is not parsimonious." In most cases, there is no explicit alternative proposed; if no alternative is available, any statistical method is prevalent to none at all. Additionally, it is not clear what would be meant if the statement "evolution is parsimonious" were in fact true. In at least some cases, the implication is that more character changes may have occurred historically than are predicted using the parsimony criterion. Because parsimony phylogeny estimation reconstructs the minimum number of changes necessary to explain a tree, this is in fact quite true. However, it has been shown through simulation studies, testing with known in vitro viral phylogenies, and congruence with other methods, that the accuracy of parsimony is in most cases not compromised by this. Parsimony analysis uses the number of character changes on trees to choose the best tree, but it does not require that exactly that many changes, and no more, produced the tree. In most cases, parsimony exhibits minimal bias as a result of choosing the tree with the fewest changes.
An analogy can be drawn with choosing a contractor based on their initial (nonbinding) estimate of the cost of a job. The actual finished cost is very likely to be higher than the estimate. Despite this, choosing the contractor who furnished the lowest estimate should theoretically result in the lowest final project cost. This is because, in the absence of other data, we would assume that all of the relevant contractors have the same risk of cost-overruns. In practice, of course, unscrupulous business practices may bias this result; in phylogenetics, too, some particular phylogenetic problems (for example, long branch attraction above) may potentially bias results. In both cases, however, there is no way to tell if the result is going to be biased, o the degree to which it will be biased, based on the estimate itself. With parsimony too, there is no way to tell that the data are positively misleading, without comparison to other evidence.
Along the same lines, parsimony is often characterized as implicitly adopting the philosophical position that evolutionary change is rare, or that homoplasy (convergence and reversal) is minimal in evolution. This is not entirely true: parsimony minimizes the number of convergences and reversals that are assumed by the preferred tree, but this may result in a relatively large number of such homoplastic events. It would be more appropriate to say that parsimony assumes only the minimum amount of change implied by the data. As above, this does not require that these were the only changes that occurred; it simply does not infer changes for which there is no evidence. The shorthand for describing this is that "parsimony minimizes assumed homoplasies, it does not assume that homoplasy is minimal."
Parsimony is also sometimes associated with the notion that "the simplest possible explanation is the best," a generalisation of Occam's Razor. Parsimony does prefer the solution that requires the least number of unsubstantiated assuptions and unsupportable conclusions, the solution that goes the least theoretical distance beyond the data. This is a very common approach to science, especially when dealing with systems that are so complex as to defy simple models. Parsimony does not by any means necessarily produce a "simple" assumption. Indeed, as a general rule, most character datasets are so "noisy" that no truly "simple" solution is possible.
[edit] Alternatives
There are several other methods for inferring phylogenies based on discrete character data. Each offers potential advantages and disadvantages. Most of these methods have particularly avid proponents and detractors; parsimony especially has been advocated as philsophically superior (most noteably by ardent cladists).
- Maximum Likelihood
Among the most popular alterantive phylogenetic methods is maximum likelihood phylogenetic inference, sometimes simply called "likelihood" or "ML." Maximum likelihood is an optimality criterion, as is parsimony. Mechanically, maximum likelihood analysis functions much like parsimony analysis, in that trees are scored based on a character dataset, and the tree with the best score is selected. Maximum likelhood is a parametric statistical method, in that it employs an explicit model of character evolution. Such methods are potentially much more powerful than non-parametric statistical methods like parsimony, but only if the model used is a reasonable approximation of the processes that produced the data. Maximum likelihood (often just termed likelihood) has probably surpassed parsimony in popularity with nucleotide sequence data, and Bayesian phylogenetic inference, which uses the likelihood function, is becoming almost as prevalent.
Likelihood is the relative counterpart to absolute probability. If we know the number of possible outcomes of a test (N), and we know the number of those outcomes that fit a particular criterion (n), we can say that the probability of that criterion being met by an execution of that test is n/N. Thus, the probability of heads in the toss of a fair coin is 50% (1/2). What if we don't know the number of possible outcomes? Obviously, we cannot then calculate probabilities. However, if we observe that one outcome happens twice as often as the other over an arbitrarily large number of tests, we can say that that outcome is twice as likely. Likelihoods are proportional to the true probabilities: if an outcome is twice as likely, we can say that it is twice as probable, even though we cannot say how probable it is.
Practically, the probability of a tree cannot be calculated directly. The probability of the data given a tree can be calculated if you assume a specific set of probabilities of character change (a model). The critical part of likelihood analysis is that the probability of the data given the tree is the likelihood of the tree given the data. Thus, the tree that has the highest probability of producing the observed data is the most likely tree.
Maximum likelihood, as implemented in phylogenetics, uses a stochastic model that gives the probability of a particular character changing at any given point on a tree. This model can have a potentially large number of parameters, which can account for differences in the probabilities of particular states, the probabilities of particular changes, and differences in the probabilities of change among characters.
A likelihood tree has meaningful branch lengths (i.e. it is a phylogram); these lengths are usually interpreted as being proportional to the average probability of change for characters on that branch (thus, on a branch of length 1, we would expect an average of one change per character, which is a lot). The state of each character is plotted on the tree, and the probability of that distribution of character states is calculated using the model and the branch lengths (which can be altered to maximize the probability of the data). This is the probability of that character, given the tree. The probabilities of all of the characters is multiplied together; they are usually negative log-transformed and added (producing the same effect), because the numbers become very small very quickly. This sum is the probability of the data, given the tree, or the likelihood of the tree. The tree with the highest likelihood (lowest negative log-transformed likelihood) given the data is preferred.
In the above analogy regarding choosing a contractor, maximum likelihood would be analogous to gathering data on the final cost of broadly comparable jobs performed by each contractor over the past year, and selecting the contractor with the lowest average cost for those comparable jobs. This method would be highly dependent on how comparable the jobs are, but, if they are properly chosen, it will produce a better estimate of the actual cost of the job. Further, it would not be mislead by bias in contractor estimates, because it is based on the final cost, not on the (potentially biased) estimates.
In practice, maximum likelihood tends to favor trees that are very similar to the most parsimonious tree(s) for the same dataset. It has been shown to outperform parsimony in certain situations where the latter is known to be biased, including long-branch attraction. Note, however, that the performance of likelihood is dependent on the quality of the model employed; an incorrect model can produce a biased result. Studies have shown that, often, incorporating a parameter to account for differences in rate of evolution among characters is often critical to accurate estimation of phylogenies; failure to model this or other crucial parameters may produce incorrect or biased results. Model parameters are usually estimated from the data, and the number (and type) of parameters is often determined using the heirarchiachal likelihood ratio test. The consequences of mis-specified models are just beginning to be explored in detail.
Likelihood is generally regarded as a more desireable method than parsimony, in that it is statistically consistent, and has a better statistical foundation, and because it allows complex modelling of evolutionary processes. A major drawback is that ML is still quite slow relative to parsimony methods, sometimes requiring days to run large datasets. Maximum likelihood phylogenetic inference was proposed in the mid-Twentieth Century, but it has only been a popular method for phylogenetic inference since the 1990s, when computational power caught up with tremendous demands of ML analysis. Newer algorithms and implementations are bringing analysis times for large datasets into acceptable ranges. Until these methods gain widespread acceptance, parsimony will probably be preferred for extremely large datasets, especially when bootstrapping is used to assess confidence in the results.
One area where parsimony still holds much sway is in the analysis of morphological data. Until recently, stochastic models of character change were not available for non-molecular data. New methods, proposed by Paul Lewis, make essentially the same assumptions that parsimony analysis does, but do so within a likelihood framework. These models are not, however, widely implemented, and, unless modified, they require the modification of existing datasets (to deal with ordered characters, and the tendency to not record autapomorphies in morphological datasets.
Maximum likelihood has been criticised as assuming neutral evolution implictly in its adoption of a stochastic model of evolution. This is not necessarily the case: as with parsimony, assuming a stochastic model does not presume that all evolution is stochastic. In practice, likelihood is robust to deviations from stochasticity. It performs well even on coding sequences that include cites believed to be under selection.
A related objection (often brought up by parsimony-only advocates) is the idea that evolution is too complex or too poorly understood to be modeled. This objection probably rests on a misunderstanding of the term "model." While it is customary to think of models as representing the mechanics of a process, this is not necessarily literally the case. In fact, a model is often selected not so much for its faithful reproduction of the phenomenon as its ability to make predictions. In practice, it is best not to try and exactly fit a model to a process, because there is a trade-off between number of parameters in a model and its statistical power. Stochasticity may be a reasonably good fit to evolutionary data at a broad level, even if it does not accurately mirror the process at finer scales.
By analogy, no one claims that the human foot varies only in length and width, but differing combinations of length and width values can be combined to fit a wide variety of feet. In some cases, a slightly wider overall foot may be better fitted by increasing overall size rather than instep width, while a foot with a narrower heel might be better fit by a wider instep and a smaller shoe. Adding several more measurements would probably improve shoe fit somewhat, but would be impractical from a business standpoint. With increasingly precise fitting, differences between feet would make selling matched pairs of shoes impossible, and differences through time would mean that a proper fit at purchase might not be a proper fit when worn.
Parsimony has recently been shown to be more likely to recover the true tree in the face of profound changes in evolutionary ("model") parameters (e.g., the rate of evolutionary change) within a tree (Kolaczkowski and Thornton, 2004). This is particularly troublesome, since it is generally agreed that such changes may be a significant feature of deep divergences. Likelihood has had substantial success recovering known in vitro viral phylogenies, simulated phylogenies, and phylogenies confirmed by other method. It seems likely therefore that this potential complication does not strongly bias results for more shallow divergences. Several research groups are currently exploring ways to incorporate profound shifts in evolutionary parameters into likelihood analysis.
Bayesian phylogenetics uses the likelihood function, and is normally implemented using the same models of evolutionary change used in Maximum Likelihood. It is very different, however, in both theory and application.Bayesian statistics is interesting because it takes into account ones a priori beliefs about the expected results of a test (called the prior probability), and gives a revised estimate of probabilities based on the results of a test (posterior probabilities). This is quite different from frequentist statistics, but is rather similar to the way in which people ordinarily address questions.
Bayesian phylogenetic analysis uses Bayes' theorem, which, simply put, relates the posterior probability of a tree to the likelihood of data, and the prior probabilty of the tree and model of evolution. However, unlike parsimony and likelihood methods, Bayesian analysis does not produce a single tree or set of equally optimal trees. Bayesian analysis uses the likelihood of trees in a Markov chain Monte Carlo (MCMC) simulation to sample trees in proportion to their likelihood, thereby producing a credible sample of trees. Following the mathematical application of Bayes' theorem, particular relationships (usually taken to mean particular branches or clades) occur within this set of trees in proportion to their posterior probability. Thus, if a particular grouping appears in 759 of 1000 trees resulting from a Bayesian analysis, this group has a posterior probability of 75.9%. Unlike other measures of support (such as bootstrap percentages), this value can be interpreted directly as the probability that that relationship represents the real phylogeny of the organisms, given the data, the model, and the prior probabilities.
The straightforward interpretation of Bayesian posterior probabilities, the automatic production of a confidence set of trees, and the relative computational ease of the Markov chain Monte Carlo approach (broadly comparable in computational time to a single ML analysis) are rapidly bringing Bayesian analysis into the mainstream. Much work is being expended making Bayesian analyses more flexible; an especially promising line of inquiry, one shared with ML analysis, is the exploration of integrating likelihood estimates over nuissance paramters (branch lengths, model parameters); this should improve estimates of the variables of interest (usually the tree).
One commonly cited drawback of Bayesian analysis is the need to explicitly set out a set of prior probabilities for the range of potential outcomes. The idea of incorporating prior probabilities into an analysis has been suggested as a potential source of bias. This is, in fact, a misunderstanding of the point of Bayesian analysis, which is to assess the support for changing an a priori hypothesis. Still, it is possible to specify uninformative priors, which do not prefer any particular hypothesis. Arguably, some hypotheses are more likely than others (e.g., it is unlikely that mollusks will be found to be vertebrates), and a reasonable analysis should probably reflect this. Bayesian methods involve other potential issues, such as the evaluation of "convergence," the point at which the MCMC process stops searching for the "space" of credible solutions and begins to build the credible sample. At present, it there is no objective way to evaluate convergence, and it remains to be seen if subjective methods are effective.
- Distance Methods
Non-parametric distance methods were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree (a phylogram, with informative branch lengths). The distance matrix can come from a number of different sources, including measured distance (for example from immunological studies) or morphometric analysis, various pairwise distance formulae (such as euclidean distance) applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states (Manhatten Distance).
In general, pairwise distance data are an underestimate of the path-distance between taxa on a phylogram. Pairwise distances effectively "cut corners" in a manner analogous to geographic distance: the distance between two cities may be 100 miles "as the crow flies," but a traveler may actually be obligated to travel 120 miles because of the layout of roads, the terrain, stops along the way, etc. Between pairs of taxa, some character changes that took place in ancestral lineages will be undetectable, because later changes have erased the evidence (often called multiple hits and back-mutations in sequence data). This problem is common to all phylogenetic estimation, but it is particularly acute for distance methods, because only two samples are used for each distance calculation; other methods benefit from evidence of these hidden changes found in other taxa not considered in pairwise comparisons. For nucleotide and amino acid sequence data, the same stochastic models of nucleotide change used in maximum likelihood analysis can be employed to "correct" distances, rendering the analysis "semi-parametric."
Several simple algorithms exist to construct a tree directly from pairwise distances, including UPGMA and neighbor joining (NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data, minimum evolution (ME) and least-squares. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data.
Phylogeny estimation using distance methods has produced a number of controversies. UPGMA assumes an ultrametric tree (a tree where all the path-lengths from the root to the tips are equal). If the rate of evolution were equal in all sampled lineages (a molecular clock), and if the tree were completely balanced (equal numbers of tax aon both sides of any split, to counter the node density effect), UPGMA should not produce a biased result. These expectations are not met by most datasets, and although UPGMA is somewhat robust to their violation, it is not commonly used for phylogeny estimation.
Neighbor-joining is a form of star decomposition, and can very quickly produce reasonable trees. It is very often used on its own, and in fact quite frequently produces reaasonable trees. However, it lacks any sort of tree search and optimality criterion, and so there is no guaruntee that the recoverd tree is the one that best fits the data. A more appropriate analytical procedure would be to use NJ to produce a starting tree, then employ a tree search using an optimality criterion, to ensure that the best tree is recovered.
Many scientists eschew distance methods. In some cases, this is for esoteric philosophical reasons. More practically, distance methods are avoided because they do not use the character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Further, the relationship between individual characters and the tree is lost in the process of reducing characters to distances. Also, at least some distance methods (for example UPGMA) have fairly strong biases.
Further, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using maximum likelihood. This is because pairwise distances are not independent; each branch on a tree is represented in the distance measurements of all taxa it separates. Error resulting from any characteristic of that branch that might confound phylogeny (stochastic variability, change in evolutionary parameters, an abnormally long or short branch length) will be propagated through all of the relevant distance measurements. The resulting distance matrix may then better fit an alternate tree.
Although there are a number of circumstances when distance methods can be expected to produce inadequate results, in practice they are extremely fast, and they often produce a reasonable estimate of phylogeny. One advantage over all other techniques is the availability of LogDet distances; these distances account for the possibility that the rate at which particular nucleotides are incorporated into sequences may vary over the tree. Distance methods are extremely popular among some molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly-instantaenous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, and the occaisional need to summarize relationships in with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.
[edit] References
- J. Felsenstein. (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool., 27:401-410.
- B. Kolaczkowski and J. W. Thornton. (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. 'Nature', 4331:980-984.
Top of pageAll inferences in comparative biology depend on
| Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics |
| Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction |
| Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA | Bayesian inference |
| Current topics: PhyloCode | DNA barcoding |
| List of evolutionary biology topics |


