Frederick “Erick” A. Matsen IV

Fred Hutchinson Cancer Research Center, Seattle

*Map ecological samples to various spaces associated with a relevant phylogenetic tree, and perform sample comparison and ordination in those inner product spaces.*

Data type

Part 1: comparison in tree-metric space

Think of each collection of sequences as an empirical distribution of positions on a phylogenetic tree. [Note that I’ve simplified and un-rooted the tree, and changed the distribution.]

(Hint: thinking of samples as probability distribs on a metric space)

Minimal amount of work to move probability mass in one configuration to another.

Map sample \(s\) to a probability distribution on a phylogenetic tree equipped with the length measure, i.e. \(s \mapsto f_s(x)\), and find distances between samples using a distance in that space:

\[ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)) \]

where \(d_{\text{KR}(\mu)}\) is the KR distance with respect to the length measure \(\mu\) on the tree.

Part 1: Comparison in tree-metric space

\[ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)) \]

Part 2: Principal components analysis in edge-space

The image \(g_s\) of a sample \(s\) for an edge \(e\) is the amount of mass on one side of \(e\) minus the amount of mass on the other.

We map samples into \(|E|\) dimensional real space:

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

The eigenvectors of the covariance matrix in \({\mathbb R}^{|E|}\) are the principal components.

Can visualize eigenvectors (in \({\mathbb R}^{|E|}\)) using color to represent sign and thickness to represent magnitude.

Part 1: Comparison in tree-metric space \[ S \longrightarrow {\mathbb L}^2(T); \ \ d(r,s) = d_{\text{KR}(\mu)}(f_r(x),f_s(x)). \]

Part 2: Principal components analysis in edge-space

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb R}^{|E|}, \langle g_r, g_s \rangle_E := \sum_e g_r(e) g_s(e) \right). \]

Part 3: Principal components analysis in tree-metric space

defines a real-valued function \(g_s(x)\) for \(s\).

Extending linearly again gives our map:

\({\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right)\)

\[ {\mathbb R}^{|S|} \longrightarrow \left({\mathbb L}^2(T), \ \ \langle g_r,g_s \rangle_\mu := \int_T g_r(x) g_s(x) d\mu(x)\right) \]

This takes some work, but is made possible because these functions are piecewise constant in our implementation.

Thank you Brian C. Claywell!

(Manuscript soon on arXiv.)

Application to data

Data from David Fredricks’ lab, with Sujatha Srinivasan

- Investigates bacterial vaginosis
- 222 women from the King County STD clinic
- Sequences of the 16s gene from vaginal swabs
- Samples sequenced using 454 pyrosequencing (thousands of reads per sample)

**Steven N. Evans and Brian C. Claywell**

*Group*: Connor McCoy and Christopher Small

*Fred Hutchinson Cancer Research Center*: David Fredricks, Noah Hoffman, Martin Morgan, and Sujatha Srinivasan

*Funding*: NIH Human Microbiome Project and NSF Algorithms for Threat Detection

Go see Susan Holmes, 9:15am on Tuesday in session 276 who will be speaking on related work.

You can view this talk at matsen.fhcrc.org under “Research.”

We have an **open postdoc position**.