Phylogenetics methods: likelihood

Erick Matsen

NOTE these slides have been superseded by combining the lewis lectures with some of the methods intro.

Likelihood setup

Come up with a statistical model of experiment
Parametrize that model
Evaluate likelihood under various parameter values

Example: picking peaches

Say that, after harvesting 20 peaches, we have 6 ripe ones.

Model using the binomial distribution. Say \(p\) is the probability of getting a ripe peach, and each draw is independent.

The likelihood of getting the observed result is \[ { {20} \choose 6} \, p^6 \, (1-p)^{20-6}. \] Recall: \({ {20} \choose 6}\) is the number of ways of choosing 6 items out of 20.

Peach picking likelihood surface

The maximum likelihood estimate of the parameter(s) of interest is the parameter value(s) that maximize the likelihood.

~~The maximum likelihood estimate of the parameter(s) of interest is the parameter value(s) that maximize the likelihood.~~

Fiddle with the dials until it looks good.

Questions

What is the maximum likelihood (ML) estimate of \(p\) given our experiment?
Would the result of this ML estimate be different if we got 60 ripe peaches out of 200?
Intuitively, would the shape of the likelihood curve be different with this larger dataset?

Likelihood recap

Maximum likelihood is a way of inferring unknown parameters
To apply likelihood, we need a model of the system under investigation
In general, the “likelihood” is the likelihood of generating the data under the given parameters, written \(P(D | \theta),\) where \(D\) is the data and \(\theta\) are the parameters.

Setup for likelihood based phylogenetics

The phylogenetic likelihood of a tree is the likelihood of generating the observed data given that tree (under the sequence evolution model).

Note that the UW’s own Joe Felsenstein was the first to formalize this and develop efficient algorithms.

Sequence evolution models describe the rate of substitution from one symbol to another

\[ Q={\begin{pmatrix}-\mu _{A}&\mu _{AG}&\mu _{AC}&\mu _{AT}\\\mu _{GA}&-\mu _{G}&\mu _{GC}&\mu _{GT}\\\mu _{CA}&\mu _{CG}&-\mu _{C}&\mu _{CT}\\\mu _{TA}&\mu _{TG}&\mu _{TC}&-\mu _{T}\end{pmatrix}} \]

\[ Q_{\text{HKY}}={\begin{pmatrix}{*}&{\kappa \pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\kappa \pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\kappa \pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\kappa \pi _{C}}&{*}\end{pmatrix}} \]

Different models for different data

Nucleotide models are fit “on the fly”
- e.g. F81, HKY, GTR
Protein models are typically pre-made
- e.g. JTT (Jones, Taylor, and Thornton), and WAG (Whelan and Goldman) matrices
Codon models are a great idea
- Position matters!
- e.g. SRD06 model

Model hierarchy, from Posada and Crandall

Calculating likelihood of a single column

Likelihood of an alignment

Likelihood phylogenetics recap

In likelihood phylogenetics, explicitly model mutation process
This allows “complex” models to be used
Statistical basis allows us to make formal statements about uncertainty
But on the other hand our models are over-simple!

Crazy but typical model assumptions

differences between sequences only appear by point mutation
evolution happens on each column independently
sequences are evolving according to reversible models (this excludes selection and directional evolution of base composition)
the evolutionary process is identical on all branches of the tree