Antibodies bind antigens

Too many antigens to code for directly

B cell diversification process

You have an enzyme which incorporates random nucleotides into your genome.

It’s called Terminal deoxynucleotidyl transferase or TdT.

You have an enzyme which is purpose-built to randomly mutate your genome.

It’s called Activation-Induced Deaminase or AID.

Tas et al. 2016

We can sequence BCRs in high throughput

Determining homology is hard!

The mutation process is crazy!

High context-sensitivity of mutation

Big, beautiful, messy data for the study of evolution

We don’t need lots of little techniques…

we only need models and one principle: likelihood.

How can we learn using probabilistic models?

HMM intro: dishonest casino

Generative probabilistic model \(\rightarrow\) inference

Develop probabilistic model of biological process and how data is generated.
Find parameter choices that maximize the likelihood of generating the observed data.

When a field of bioinformatics is mature, it becomes statistics.
E.g. Maximum-likelihood phylogenetics, HMMER, DESeq2, etc.

VDJ rearrangement

Biased die roll \(\leftrightarrow\) read base (under mutation)
Switching dice \(\leftrightarrow\) changing between V, D, J genes

VDJ annotation problem: from where did each nucleotide come?

A: Take maximum-likelihood HMM path.

Find clonal families

Did two sequences come from the same VDJ recombination?

Say we are given two sequences

Double roll \(\leftrightarrow\) Pair HMM

Pick double roll hypothesis if it has higher likelihood of generating data.

What about BCRs?

(But we only know the sequences, not the annotations!)

Do two sequences come from a single rearrangement event?

Probability of generating observed sequence \(x\) from HMM:

\[ \mathbb P(x) = \sum_{\text{paths}\ \sigma} \mathbb P(x;\sigma), \]

Probability of generating two sequences \(x\) and \(y\) from the same path through the HMM (i.e. from the same rearrangement event):

\[ \mathbb P(x,y) = \sum_{\text{paths}\ \sigma} \mathbb P(x,y;\sigma), \]

\[ \text{Calculate: } \frac{\mathbb P(x, y)}{\mathbb P(x) \mathbb P(y)} = \frac{\mathbb P(\text{single rearrangement})}{\mathbb P(\text{independent rearrangements})} \]

Do sets of sequences come from a
single rearrangement event?

\[ \frac{\mathbb P(A \cup B)}{\mathbb P(A) \mathbb P(B)} = \frac{\mathbb P(A \cup B \ | \ \text{single rearrangement})}{\mathbb P(A,B \ | \ \text{independent rearrangements})} \]

Use this for agglomerative clustering; stop when the ratio < 1.

What are probabilities?

Distributions are reproducibly weird!

Sparse data & ML parameter inference
can be dangerous

Say we are flipping a coin with probability \(p\) of giving heads:

\[ \Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k} \]

Introductory aside: Bayesian estimation

Erick, click here

Priors guide parameter inferences in the
absence of evidence to the contrary.

How to regularize per site?

Highly observed sites inform estimation for rarely observed ones.

–>

Per-site rate estimates for 1 individual

data from Stern, Yaari, Heiden … O’Connor (2014). B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Science Trans Med, 6(248), 248ra107.

Per-column mutation prior mean

Between-individual consistency (thickness)

Per-site prior mean between individuals

Natural selection inference

Would like per-site selection inference

\[ \omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of non-synonymous substitution}}{\hbox{rate of synonymous substitution}} \]

but these rates need to be scaled to the neutral substitution process.

Productive vs. out-of-frame receptors

Each cell may carry two IGH alleles, but only one is expressed.

\[ \omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of non-synonymous substitution}}{\hbox{rate of synonymous substitution}} \]

Out-of-frame reads can be used to infer neutral mutation rate!

\(\omega_l\) is a ratio of rates in terms of observed neutral process

\(\lambda_l^{(N-I)}\): nonsynonymous in-frame rate for site \(l\)
\(\lambda_l^{(N-O)}\): nonsynonymous out-of-frame rate for site \(l\)
\(\lambda_l^{(S-I)}\): synonymous in-frame rate for site \(l\)
\(\lambda_l^{(S-O)}\): synonymous out-of-frame rate for site \(l\)

\[ \omega_l = \frac{\lambda_l^{(N-I)} / \lambda_l^{(N-O)}}{\lambda_l^{(S-I)} / \lambda_l^{(S-O)}} \]

Natural selection on “framework” region

Conclusion

B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertion patterns
… with remarkably consistent substitution and selection

We can learn about these processes using model-based inference.

Thank you

National Science Foundation and National Institute of Health
University of Washington Center for AIDS Research (CFAR)
University of Washington eScience Institute

Ralph & M. (2016). Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation. PLOS Computational Biology.
Ralph & M. (2016). Likelihood-based inference of B-cell clonal families. arXiv, and in press, PLOS Computational Biology.
McCoy, Bedford, Minin, Bradley, Robins, & M. (2015). Quantifying evolutionary constraints on B-cell affinity maturation. Phil Trans Royal Soc London (B).

http://matsen.group/ http://B-T.CR/

Addenda

HMM method outperforms VJ-CDR3 method

The hierarchical setup

Substitution rate \(\lambda_s \sim \operatorname{Gamma}(\omega_c, \theta_c)\)
Gamma mode \(\omega_c \sim \operatorname{Log-normal}(1, 1)\)
Gamma dispersion \(\theta_c \sim \text{Lévy}(3)\)
Stationary distribution \(\pi_s \sim \operatorname{Dirichlet}(3, 3, 3, 3)\)
Branch length \(t \sim \operatorname{Exponential}(0.1)\)

Points = rates; triangles = 95% CI of fit prior

Are fit priors actually helpful?

Hold out data from three very frequently observed genes:
IGHV3-23D, IGHV4-61, and IGHV5-51
Infer “gold standard” parameters for the three genes with a fit prior and, separately, a fixed prior using these data. Set aside.
Fit parameters for the data set using progressively more held-out data using a fit and a fixed prior.
Which one is closer to its gold-standard value?

Data from Adaptive Biotechnologies.

Fit priors are helpful for < 1000 reads

Tree principal components plot:

Points are germline genes.

Genes are positioned according to a principal coordinates projection of the within host evolution of that gene.

Note: the phylogenetic structure does not directly influence the position of the points.

Data: Adaptive individual A, subset to genes with 5000 or more observations

Random facts

Mean length of D segment in individual A’s naive repertoire is 16.61.
Subject A’s naive sequences were 37% CDR3
Divergence between the various germ-line V genes:

> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.003846 0.201300 0.344600 0.304700 0.384900 0.539500

B cell receptors: a grand and beautiful evolutionary process

Antibodies bind antigens

Too many antigens to code for directly

B cell diversification process

We can sequence BCRs in high throughput

Determining homology is hard!

The mutation process is crazy!

High context-sensitivity of mutation

Big, beautiful, messy data for the study of evolution

We don’t need lots of little techniques…

HMM intro: dishonest casino

HMM intro: dishonest casino

HMM intro: dishonest casino

HMM intro: dishonest casino

VDJ rearrangement

Find clonal families

Find clonal families

Say we are given two sequences

Double roll \(\leftrightarrow\) Pair HMM

Double roll \(\leftrightarrow\) Pair HMM

Do two sequences come from a single rearrangement event?

Do sets of sequences come from a single rearrangement event?

What are probabilities?

Distributions are reproducibly weird!

Distributions are reproducibly weird!

Distributions are reproducibly weird!

Sparse data & ML parameter inference can be dangerous

Introductory aside: Bayesian estimation

How to regularize per site?

Per-site rate estimates for 1 individual

Per-column mutation prior mean

Per-column mutation prior mean

Between-individual consistency (thickness)

Per-site prior mean between individuals

Natural selection inference

Would like per-site selection inference

Productive vs. out-of-frame receptors

\(\omega_l\) is a ratio of rates in terms of observed neutral process

Natural selection on “framework” region

Natural selection on “framework” region

Natural selection on “framework” region

Conclusion

Thank you

Addenda

HMM method outperforms VJ-CDR3 method

The hierarchical setup

Points = rates; triangles = 95% CI of fit prior

Points = rates; triangles = 95% CI of fit prior

Are fit priors actually helpful?

Fit priors are helpful for < 1000 reads

Random facts

Do sets of sequences come from a
single rearrangement event?

Sparse data & ML parameter inference
can be dangerous