Leveraging the discrete and continuous structure of phylogenetic trees for the analysis of metagenomic data

Frederick “Erick” A. Matsen IV
Fred Hutchinson Cancer Research Center, Seattle

@ematsen       slides at http://matsen.fhcrc.org/research.html

We go out on a field expedition to 4 locations

At each location we observe some organisms

Can organize samples based on organisms seen: rodent samples belong together, as do primates

Reason: evolutionary history (a.k.a. phylogeny)

Phylogeny can be used to organize samples

Talk overview

Map ecological samples to various spaces associated with a relevant phylogenetic tree, and perform sample comparison and ordination in those inner product spaces.

 

Joint work with Steve Evans (Berkeley)

Data type

Read “same” DNA segment from bugs in sample

Compare sequence “reads” with public databases

Map sequences to tree

Data type recap

Part 1: comparison in tree-metric space

First, let’s abstract a little for convenience

Think of each collection of sequences as an empirical distribution of positions on a phylogenetic tree. [Note that I’ve simplified and un-rooted the tree, and changed the distribution.]

Two samples: how to compare?

(Hint: thinking of samples as probability distribs on a metric space)

Kantorovich-Rubenstein/Earth-mover

Minimal amount of work to move probability mass in one configuration to another.

Comparison in tree-metric space

Map sample \(s\) to a probability distribution on a phylogenetic tree equipped with the length measure, i.e. \(s \mapsto f_s(x)\), and find distances between samples using a distance in that space:

 

\[ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)) \]

 

where \(d_{\text{KR}(\mu)}\) is the KR distance with respect to the length measure \(\mu\) on the tree.

 

Lozupone, C. A., Hamady, M., Kelley, S. T. & Knight, R. Quantitative and Qualitative Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities. Appl. Environ. Microbiol. 73, 1576–1585 (2007).

 

Evans, S. N. & FAM The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Stat. Soc. Series B 74, 569–592 (2012).

Summary so far:

Part 1: Comparison in tree-metric space

\[ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)) \]

Part 2: Principal components analysis in edge-space

Now map samples into edge space \({\mathbb R}^{|E|}\)

The image \(g_s\) of a sample \(s\) for an edge \(e\) is the amount of mass on one side of \(e\) minus the amount of mass on the other.

 

We map samples into \(|E|\) dimensional real space:

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

Principal components in \({\mathbb R}^{|E|}\)

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

 

The eigenvectors of the covariance matrix in \({\mathbb R}^{|E|}\) are the principal components.

 

Can visualize eigenvectors (in \({\mathbb R}^{|E|}\)) using color to represent sign and thickness to represent magnitude.

 

Visualizing projected samples

Summary so far:

Part 1: Comparison in tree-metric space \[ S \longrightarrow {\mathbb L}^2(T); \ \ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)). \]

 

Part 2: Principal components analysis in edge-space

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

Part 3: Principal components analysis in tree-metric space

Motivation I: introducing an unobserved taxon changes edge-space

Motivation II: branch lengths matter

The amount of mass on one side of \(x\) minus the amount of mass on the other for sample \(s\),

 

defines a real-valued function \(g_s(x)\) for \(s\).

 

 

Extending linearly again gives our map:

 

\({\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right)\)

“Length” PCA: PCA after mapping to tree-metric space

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right) \]

 

This takes some work, but is made possible because these functions are piecewise constant in our implementation.

 

Thank you Brian C. Claywell!

 

 

(Manuscript soon on arXiv.)

Application to data

Data set of vaginal microbial swabs

Data from David Fredricks’ lab, with Sujatha Srinivasan

  • Investigates bacterial vaginosis
  • 222 women from the King County STD clinic
  • Sequences of the 16s gene from vaginal swabs
  • Samples sequenced using 454 pyrosequencing (thousands of reads per sample)

 

 

Length principal components on vaginal swabs

Map ecological samples to various spaces associated with a relevant phylogenetic tree, and perform sample comparison and ordination in those inner product spaces.

 

Thank you

Steven N. Evans and Brian C. Claywell


Group: Connor McCoy and Christopher Small

Fred Hutchinson Cancer Research Center: David Fredricks, Noah Hoffman, Martin Morgan, and Sujatha Srinivasan

Funding: NIH Human Microbiome Project and NSF Algorithms for Threat Detection


Go see Susan Holmes, 9:15am on Tuesday in session 276 who will be speaking on related work.


You can view this talk at matsen.fhcrc.org under “Research.”

We have an open postdoc position.