Prospective studies are expensive, slow, and entail complex moral issues. This does not lend itself to rapid vaccine development.
How might we guide vaccine development without disease exposure?
Antibody-making B cells: a key part of adaptive immunity.
What can we learn from B cells without battle-testing them?
These are needed for likelihood-based phylogenetic inference.
Which sites can be changed?
Plenty: a total of about 15 million unique 130nt sequences from memory B cell populations of three healthy individuals A, B, and C.
Investigate overall mutation patterns of the B cell repertoire.
Models like this are used throughout phylogenetic inference.
In fact, most don’t.
This is different from traditional phylogenetics.
Our “trees” have an observed read on the bottom and the corresponding “ancestral” germline sequence on top, connected by a branch, representing some amount of divergence.
We will test various models for the V, D, and J segments to select an appropriate evolutionary model for somatic hypermutation to eventually use in phylogenetic inference.
Identical model ranking across individuals (using AIC / BIC).
(each point is a single entry for one of the matrices for a pair of individuals.)
Each cell may carry two IGH alleles, but only one is expressed.
Subdivide V genes:
and fit each subset separately.
Inspired by the work of (Kosakovsky Pond et al, 2010) on evolutionary “fingerprinting.”
Unproductive rearrangements are more likely to be either: unchanged from germline, or more divergent.
We can’t answer that directly, but we can look across the repertoire at which sites have tolerated change.
\[ \omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of non-synonymous substitution}}{\hbox{rate of synonymous substitution}} \]
We want to estimate this value for each site:
millions of unique sequences
(rules out otherwise lovely tools like PAML
, HYPHY
):
“FUBAR [HYPHY] allows us to analyze larger data sets than other methods: We illustrate this on a large influenza hemagglutinin data set (3,142 sequences)” – Murrell et. al 2013
Per-site inference is made difficult by a complicated mutation process
We can use this to tell us about the neutral mutation process.
\[ \omega_l = \frac{\lambda_l^{(N-I)} / \lambda_l^{(N-O)}}{\lambda_l^{(S-I)} / \lambda_l^{(S-O)}} \]
Say we are doing a per-county smoking survey.
Assume that \(\lambda_l\), the substitution rate at site \(l\), comes from a Gamma distribution with shape \(\alpha\) and rate \(\beta\):
\[ \lambda_l \sim \mathrm{Gamma}(\alpha, \beta). \]
Model total substitution counts (sampled via stochastic mapping) for a site as Poisson with rate \(\lambda_l\):
\[ C_l \sim \mathrm{Poisson}(\lambda_l), \]
Fit \(\hat \alpha\) and \(\hat \beta\) to all data, then draw rates \(\lambda_l\) from the posterior:
\[ \lambda_l \mid C_l \sim \mathrm{Gamma}(C_l + \hat \alpha, 1 + \hat \beta). \]
\[ \omega_l = \frac{\lambda_l^{(N-I)} / \lambda_l^{(N-O)}}{\lambda_l^{(S-I)} / \lambda_l^{(S-O)}} \]
For more details, paper is up on arXiv.
We have a postdoc opening to work on molecular evolution methods for HIV vaccine experimental design, and probably another for B cell work.
status | A | B | C | |
---|---|---|---|---|
functional | 4,139,983 | 4,861,800 | 3,748,306 | |
out-of-frame | 533,919 | 794,845 | 558,246 | |
stop | 104,525 | 169,423 | 112,901 |
Each dot is a pair of genes.
> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.003846 0.201300 0.344600 0.304700 0.384900 0.539500