HowTo/InferringModelsForDiscreteData

From Comparative Phylogenetics in R
Jump to: navigation, search

Modeling Discrete Character Evolution in R

The evolution of a discrete character along a branch of a phylogeny can be modeled as a Markovian process, where the probability of moving the current state to a different state is governed by a rate matrix. We can learn about discrete character evolution by calculating and comparing the likelihoods under models which impose different requirements on this matrix. For example, we might compare a Jukes-Cantor model (where the probability of going from state 0 to state 1 is equal to the probability of going in the reverse direction) to an asymetrical model where the two rates are allowed to differ. If we found the latter model was a significantly better fit, we might conclude that the trait in question evolves in a directional fashion.

Estimating transition rates and calculating the likelihood of different models for discrete character evolution can be accomplished with the packages geiger (the fitDiscrete function) and ape (the ace function).

Estimating a Jukes-Cantor model with the geiger package (load the package before you begin)

  library(geiger)

We will use the Geospiza dataset for this example and we assume that you have already loaded the sample Geospiza phylogeny and tipdata into the workspace. If not, save the linked files to your working directory and load them into memory with these commands.

  geotree <- read.nexus("geospiza.nex")
  geodata <- read.table("geospiza.txt")

So first let's create a vector with character states for "Geospiza". The dataset has 14 species. We can see a list of the species with the following command:

   geotree$tip.label

This command prints out the tip.labels for the taxa in the geospiza.tree, so you should now see a list of 14 taxa.

Now we can assign states for our discrete character to these 14 species with the commands:

     char1<-c(1,1,1,1,1,0,1,0,0,1,0,1,1,1)
     names(char1)<-geotree$tip.label
     char1

The first line binds the 14 character states into a single vector called x. Then we assign the names (tip labels) from the tree to these 14 values. So now when you type the last command, char1, R will show you the content of that object, a vector of the values with the names of species associated with the values.

More commonly you will not be typing the character states into the command line but extracting them from a table. To do this, the process would be similar.

   MyData=read.table("MyData.txt",row.names=1)
   char1<-MyData[,1]
   names(char1)<-row.names(MyData)

The first line reads in the table, indicating that the first column contains the row.names (here, your taxa with names matching the tip.labels on your tree). The next line designates the first column ([,1]) after the row names as character 1. Then you can associate the taxon names to the character states.

Now you can fit a Jukes-Cantor model to this data and the tree by typing:

    ER<-fitDiscrete(geotree, char1)
    ER

Geiger will return to you the log likelihood (lnL) for this model (-9.61) and the estimated transition rate, q (17.06).

Estimating a model with asymmetrical transition rates in Geiger

Now let's apply a model where the forward (0->1) rate is allowed to differ from the reverse rate (1->0). Type:

    ARD<-fitDiscrete(geotree,char1,model="ARD")
    ARD

ARD stand for all rates different. Now the function returns a likelihood of -7.97 and an estimated rate matrix:

    $Trait1$q
               [,1]      [,2]
    [1,] -17.908356 17.908356
    [2,]   5.995629 -5.995629

So we see that the estimated q for 0->1 is 17.9 and for 1->0 is 6.

Comparing models with the likelihood ratio test

We can use a chi-squared test to compare the likelihoods of these two models by typing:

    1-pchisq(2*(ARD$Trait1$lnl-ER$Trait1$lnl),1)

We find that the p-value (with 1 degree of freedom) is 0.07, so the asymmetrical model is not significantly better than the symmetrical (Jukes-Cantor) model.

Fitting other discrete model variants

Geiger includes other model options, such as transforming the tree with Pagel's lambda, which multiplies all internal branches of the tree by lambda while leaving tip branches as their original length. .

    LAMBDA<-fitDiscrete(geotree, char1, treeTransform="lambda")
    LAMBDA

Geiger estimates lambda as 1.06x10-6 (I get 4.54x10-05 . . . --Bsidlauskas 12:22, 11 March 2008 (EDT)). We can again use the likelihood ratio test to compare this model with the Jukes-Cantor model.

    1-pchisq(2*(LAMBDA$Trait1$lnl-ER$Trait1$lnl),1)

The p-value is 0.295, so adding lambda to the model does not significantly improve its fit.

To see a description of the other transformations available in fitDiscrete (such as Pagel's kappa and delta, and models in which rates vary across the tree, read the help file for the function.

  help(fitDiscrete)

Related Links:

References

Pagel, M. 1994. Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. Proc. R. Soc. London B 255: 37-45.

Pagel, M. 1999 Inferring the historical patterns of biological evolution. Nature', 401, 877-884.


Back to the HowTo/Table of Contents