# Talk overview

Map ecological samples to various spaces associated with a relevant phylogenetic tree, and perform sample comparison and ordination in those inner product spaces.

Data type

# Data type recap

Part 1: comparison in tree-metric space

# First, let’s abstract a little for convenience

Think of each collection of sequences as an empirical distribution of positions on a phylogenetic tree. [Note that I’ve simplified and un-rooted the tree, and changed the distribution.]

# Two samples: how to compare?

(Hint: thinking of samples as probability distribs on a metric space)

# Kantorovich-Rubenstein/Earth-mover

Minimal amount of work to move probability mass in one configuration to another.

# Comparison in tree-metric space

Map sample $$s$$ to a probability distribution on a phylogenetic tree equipped with the length measure, i.e. $$s \mapsto f_s(x)$$, and find distances between samples using a distance in that space:

$d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x))$

where $$d_{\text{KR}(\mu)}$$ is the KR distance with respect to the length measure $$\mu$$ on the tree.

Lozupone, C. A., Hamady, M., Kelley, S. T. & Knight, R. Quantitative and Qualitative Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities. Appl. Environ. Microbiol. 73, 1576–1585 (2007).

Evans, S. N. & FAM The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Stat. Soc. Series B 74, 569–592 (2012).

# Summary so far:

Part 1: Comparison in tree-metric space

$d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x))$

Part 2: Principal components analysis in edge-space

# Now map samples into edge space $${\mathbb R}^{|E|}$$

The image $$g_s$$ of a sample $$s$$ for an edge $$e$$ is the amount of mass on one side of $$e$$ minus the amount of mass on the other.

We map samples into $$|E|$$ dimensional real space:

${\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right).$

# Principal components in $${\mathbb R}^{|E|}$$

${\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right).$

The eigenvectors of the covariance matrix in $${\mathbb R}^{|E|}$$ are the principal components.

Can visualize eigenvectors (in $${\mathbb R}^{|E|}$$) using color to represent sign and thickness to represent magnitude.

# Summary so far:

Part 1: Comparison in tree-metric space $S \longrightarrow {\mathbb L}^2(T); \ \ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)).$

Part 2: Principal components analysis in edge-space

${\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right).$

Part 3: Principal components analysis in tree-metric space

# The amount of mass on one side of $$x$$ minus the amount of mass on the other for sample $$s$$,

defines a real-valued function $$g_s(x)$$ for $$s$$.

Extending linearly again gives our map:

$${\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right)$$

# “Length” PCA: PCA after mapping to tree-metric space

${\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right)$

This takes some work, but is made possible because these functions are piecewise constant in our implementation.

Thank you Brian C. Claywell!

(Manuscript soon on arXiv.)

Application to data

# Data set of vaginal microbial swabs

Data from David Fredricks’ lab, with Sujatha Srinivasan

• Investigates bacterial vaginosis
• 222 women from the King County STD clinic
• Sequences of the 16s gene from vaginal swabs
• Samples sequenced using 454 pyrosequencing (thousands of reads per sample)

# Length principal components on vaginal swabs

Map ecological samples to various spaces associated with a relevant phylogenetic tree, and perform sample comparison and ordination in those inner product spaces.

# Thank you

Steven N. Evans and Brian C. Claywell

Group: Connor McCoy and Christopher Small

Fred Hutchinson Cancer Research Center: David Fredricks, Noah Hoffman, Martin Morgan, and Sujatha Srinivasan

Funding: NIH Human Microbiome Project and NSF Algorithms for Threat Detection

Go see Susan Holmes, 9:15am on Tuesday in session 276 who will be speaking on related work.

You can view this talk at matsen.fhcrc.org under “Research.”

We have an open postdoc position.