Phylogenetics methods: likelihood

Erick Matsen

NOTE these slides have been superseded by combining the lewis lectures with some of the methods intro.

Likelihood setup

  • Come up with a statistical model of experiment
  • Parametrize that model
  • Evaluate likelihood under various parameter values

Example: picking peaches

Say that, after harvesting 20 peaches, we have 6 ripe ones.

Model using the binomial distribution. Say \(p\) is the probability of getting a ripe peach, and each draw is independent.

 

The likelihood of getting the observed result is \[ { {20} \choose 6} \, p^6 \, (1-p)^{20-6}. \] Recall: \({ {20} \choose 6}\) is the number of ways of choosing 6 items out of 20.

Peach picking likelihood surface

The maximum likelihood estimate of the parameter(s) of interest is the parameter value(s) that maximize the likelihood.

The maximum likelihood estimate of the parameter(s) of interest is the parameter value(s) that maximize the likelihood.

Fiddle with the dials until it looks good.

Questions

  1. What is the maximum likelihood (ML) estimate of \(p\) given our experiment?
  2. Would the result of this ML estimate be different if we got 60 ripe peaches out of 200?
  3. Intuitively, would the shape of the likelihood curve be different with this larger dataset?

Likelihood recap

  • Maximum likelihood is a way of inferring unknown parameters
  • To apply likelihood, we need a model of the system under investigation
  • In general, the “likelihood” is the likelihood of generating the data under the given parameters, written \(P(D | \theta),\) where \(D\) is the data and \(\theta\) are the parameters.

Setup for likelihood based phylogenetics

The phylogenetic likelihood of a tree is the likelihood of generating the observed data given that tree (under the sequence evolution model).

Note that the UW’s own Joe Felsenstein was the first to formalize this and develop efficient algorithms.

Sequence evolution models describe the rate of substitution from one symbol to another

\[ Q={\begin{pmatrix}-\mu _{A}&\mu _{AG}&\mu _{AC}&\mu _{AT}\\\mu _{GA}&-\mu _{G}&\mu _{GC}&\mu _{GT}\\\mu _{CA}&\mu _{CG}&-\mu _{C}&\mu _{CT}\\\mu _{TA}&\mu _{TG}&\mu _{TC}&-\mu _{T}\end{pmatrix}} \]

\[ Q_{\text{HKY}}={\begin{pmatrix}{*}&{\kappa \pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\kappa \pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\kappa \pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\kappa \pi _{C}}&{*}\end{pmatrix}} \]

Different models for different data

  • Nucleotide models are fit “on the fly”
    • e.g. F81, HKY, GTR
  • Protein models are typically pre-made
    • e.g. JTT (Jones, Taylor, and Thornton), and WAG (Whelan and Goldman) matrices
  • Codon models are a great idea
    • Position matters!
    • e.g. SRD06 model

Model hierarchy, from Posada and Crandall

Calculating likelihood of a single column

Likelihood of an alignment

Likelihood phylogenetics recap

  • In likelihood phylogenetics, explicitly model mutation process
  • This allows “complex” models to be used
  • Statistical basis allows us to make formal statements about uncertainty
  • But on the other hand our models are over-simple!

Crazy but typical model assumptions

  • differences between sequences only appear by point mutation
  • evolution happens on each column independently
  • sequences are evolving according to reversible models (this excludes selection and directional evolution of base composition)
  • the evolutionary process is identical on all branches of the tree