B cell receptors: a grand and beautiful evolutionary process
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
http://matsen.group/
@ematsen
Antibodies bind antigens
Too many antigens to code for directly
B cell diversification process
You have an enzyme which incorporates random nucleotides into your genome.
It’s called Terminal deoxynucleotidyl transferase or TdT.
You have an enzyme which is purposebuilt to randomly mutate your genome.
It’s called ActivationInduced Deaminase or AID.
We can sequence BCRs in high throughput
Determining homology is hard!
The mutation process is crazy!
High contextsensitivity of mutation
Big, beautiful, messy data for the study of evolution
We don’t need lots of little techniques…
we only need models and one principle: likelihood.
How can we learn using probabilistic models?
HMM intro: dishonest casino
HMM intro: dishonest casino
HMM intro: dishonest casino
HMM intro: dishonest casino
Generative probabilistic model \(\rightarrow\) inference

Develop probabilistic model of biological process and how data is generated.

Find parameter choices that maximize the likelihood of generating the observed data.
When a field of bioinformatics is mature, it becomes statistics.
E.g. Maximumlikelihood phylogenetics, HMMER, DESeq2, etc.
VDJ rearrangement
Biased die roll \(\leftrightarrow\) read base (under mutation)
Switching dice \(\leftrightarrow\) changing between V, D, J genes
VDJ annotation problem: from where did each nucleotide come?
A: Take maximumlikelihood HMM path.
Find clonal families
Find clonal families
Did two sequences come from the same VDJ recombination?
Say we are given two sequences
Double roll \(\leftrightarrow\) Pair HMM
Double roll \(\leftrightarrow\) Pair HMM
Pick double roll hypothesis if it has higher likelihood of generating data.
(But we only know the sequences, not the annotations!)
Do two sequences come from a single rearrangement event?
Probability of generating observed sequence \(x\) from HMM:
\[
\mathbb P(x) = \sum_{\text{paths}\ \sigma} \mathbb P(x;\sigma),
\]
Probability of generating two sequences \(x\) and \(y\) from the same path through the HMM (i.e. from the same rearrangement event):
\[
\mathbb P(x,y) = \sum_{\text{paths}\ \sigma} \mathbb P(x,y;\sigma),
\]
\[
\text{Calculate: } \frac{\mathbb P(x, y)}{\mathbb P(x) \mathbb P(y)} = \frac{\mathbb P(\text{single rearrangement})}{\mathbb P(\text{independent rearrangements})}
\]
Do sets of sequences come from a
single rearrangement event?
\[
\frac{\mathbb P(A \cup B)}{\mathbb P(A) \mathbb P(B)} = \frac{\mathbb P(A \cup B \  \ \text{single rearrangement})}{\mathbb P(A,B \  \ \text{independent rearrangements})}
\]
Use this for agglomerative clustering; stop when the ratio < 1.
What are probabilities?
Distributions are reproducibly weird!
Distributions are reproducibly weird!
Sparse data & ML parameter inference
can be dangerous
Say we are flipping a coin with probability \(p\) of giving heads:
\[
\Pr(X=k)={\binom {n}{k}}p^{k}(1p)^{nk}
\]
Introductory aside: Bayesian estimation
Priors guide parameter inferences in the
absence of evidence to the contrary.
How to regularize per site?
Highly observed sites inform estimation for rarely observed ones.
–>
Persite rate estimates for 1 individual
data from Stern, Yaari, Heiden … O’Connor (2014). B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Science Trans Med, 6(248), 248ra107.
Percolumn mutation prior mean
Percolumn mutation prior mean
Betweenindividual consistency (thickness)
Persite prior mean between individuals
Natural selection inference
Would like persite selection inference
\[
\omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of nonsynonymous substitution}}{\hbox{rate of synonymous substitution}}
\]
but these rates need to be scaled to the neutral substitution process.
Productive vs. outofframe receptors
Each cell may carry two IGH alleles, but only one is expressed.
\[
\omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of nonsynonymous substitution}}{\hbox{rate of synonymous substitution}}
\]
Outofframe reads can be used to infer neutral mutation rate!
\(\omega_l\) is a ratio of rates in terms of observed neutral process
 \(\lambda_l^{(NI)}\): nonsynonymous inframe rate for site \(l\)
 \(\lambda_l^{(NO)}\): nonsynonymous outofframe rate for site \(l\)
 \(\lambda_l^{(SI)}\): synonymous inframe rate for site \(l\)
 \(\lambda_l^{(SO)}\): synonymous outofframe rate for site \(l\)
\[
\omega_l = \frac{\lambda_l^{(NI)} / \lambda_l^{(NO)}}{\lambda_l^{(SI)} / \lambda_l^{(SO)}}
\]
Natural selection on “framework” region
Natural selection on “framework” region
Natural selection on “framework” region
Conclusion
B cell receptors are “drafted” and “revised” randomly, but
 … with remarkably consistent deletion and insertion patterns
… with remarkably consistent substitution and selection
We can learn about these processes using modelbased inference.
Thank you
 National Science Foundation and National Institute of Health
 University of Washington Center for AIDS Research (CFAR)
 University of Washington eScience Institute
 Ralph & M. (2016). Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation. PLOS Computational Biology.
 Ralph & M. (2016). Likelihoodbased inference of Bcell clonal families. arXiv, and in press, PLOS Computational Biology.
 McCoy, Bedford, Minin, Bradley, Robins, & M. (2015). Quantifying evolutionary constraints on Bcell affinity maturation. Phil Trans Royal Soc London (B).
http://matsen.group/ http://BT.CR/
The hierarchical setup
 Substitution rate \(\lambda_s \sim \operatorname{Gamma}(\omega_c, \theta_c)\)
 Gamma mode \(\omega_c \sim \operatorname{Lognormal}(1, 1)\)
 Gamma dispersion \(\theta_c \sim \text{Lévy}(3)\)
 Stationary distribution \(\pi_s \sim \operatorname{Dirichlet}(3, 3, 3, 3)\)
 Branch length \(t \sim \operatorname{Exponential}(0.1)\)
Points = rates; triangles = 95% CI of fit prior
Points = rates; triangles = 95% CI of fit prior
Are fit priors actually helpful?

Hold out data from three very frequently observed genes:
IGHV323D, IGHV461, and IGHV551

Infer “gold standard” parameters for the three genes with a fit prior and, separately, a fixed prior using these data. Set aside.

Fit parameters for the data set using progressively more heldout data using a fit and a fixed prior.

Which one is closer to its goldstandard value?
Data from Adaptive Biotechnologies.
Fit priors are helpful for < 1000 reads
Tree principal components plot:
Points are germline genes.
Genes are positioned according to a principal coordinates projection of the within host evolution of that gene.
Note: the phylogenetic structure does not directly influence the position of the points.
Data: Adaptive individual A, subset to genes with 5000 or more observations
Random facts
 Mean length of D segment in individual A’s naive repertoire is 16.61.
 Subject A’s naive sequences were 37% CDR3
 Divergence between the various germline V genes:
> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.003846 0.201300 0.344600 0.304700 0.384900 0.539500