B cell receptors: a grand and beautiful evolutionary process

Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center

Antibodies bind antigens

Too many antigens to code for directly

B cell diversification process

You have an enzyme which incorporates random nucleotides into your genome.


It’s called Terminal deoxynucleotidyl transferase or TdT.

You have an enzyme which is purpose-built to randomly mutate your genome.


It’s called Activation-Induced Deaminase or AID.

We can sequence BCRs in high throughput

Determining homology is hard!

The mutation process is crazy!

High context-sensitivity of mutation

Big, beautiful, messy data for the study of evolution

We don’t need lots of little techniques…

we only need models and one principle: likelihood.

How can we learn using probabilistic models?

HMM intro: dishonest casino

HMM intro: dishonest casino

HMM intro: dishonest casino

HMM intro: dishonest casino

Generative probabilistic model \(\rightarrow\) inference

  1. Develop probabilistic model of biological process and how data is generated.
  2. Find parameter choices that maximize the likelihood of generating the observed data.

When a field of bioinformatics is mature, it becomes statistics.
E.g. Maximum-likelihood phylogenetics, HMMER, DESeq2, etc.

VDJ rearrangement

Biased die roll \(\leftrightarrow\) read base (under mutation)
Switching dice \(\leftrightarrow\) changing between V, D, J genes


VDJ annotation problem: from where did each nucleotide come?

A: Take maximum-likelihood HMM path.

Find clonal families

Find clonal families

Did two sequences come from the same VDJ recombination?

Say we are given two sequences

Double roll \(\leftrightarrow\) Pair HMM

Double roll \(\leftrightarrow\) Pair HMM

Pick double roll hypothesis if it has higher likelihood of generating data.

What about BCRs?

(But we only know the sequences, not the annotations!)

Do two sequences come from a single rearrangement event?

Probability of generating observed sequence \(x\) from HMM:

\[ \mathbb P(x) = \sum_{\text{paths}\ \sigma} \mathbb P(x;\sigma), \]

Probability of generating two sequences \(x\) and \(y\) from the same path through the HMM (i.e. from the same rearrangement event):

\[ \mathbb P(x,y) = \sum_{\text{paths}\ \sigma} \mathbb P(x,y;\sigma), \]

\[ \text{Calculate: } \frac{\mathbb P(x, y)}{\mathbb P(x) \mathbb P(y)} = \frac{\mathbb P(\text{single rearrangement})}{\mathbb P(\text{independent rearrangements})} \]

Do sets of sequences come from a
single rearrangement event?


\[ \frac{\mathbb P(A \cup B)}{\mathbb P(A) \mathbb P(B)} = \frac{\mathbb P(A \cup B \ | \ \text{single rearrangement})}{\mathbb P(A,B \ | \ \text{independent rearrangements})} \]


Use this for agglomerative clustering; stop when the ratio < 1.

What are probabilities?

Distributions are reproducibly weird!

Distributions are reproducibly weird!

Distributions are reproducibly weird!

Sparse data & ML parameter inference
can be dangerous

Say we are flipping a coin with probability \(p\) of giving heads:

\[ \Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k} \]

Introductory aside: Bayesian estimation

Erick, click here


Priors guide parameter inferences in the
absence of evidence to the contrary.

How to regularize per site?

Highly observed sites inform estimation for rarely observed ones.


Per-site rate estimates for 1 individual

data from Stern, Yaari, Heiden … O’Connor (2014). B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Science Trans Med, 6(248), 248ra107.

Per-column mutation prior mean

Per-column mutation prior mean

Between-individual consistency (thickness)

Per-site prior mean between individuals

Natural selection inference



Would like per-site selection inference


\[ \omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of non-synonymous substitution}}{\hbox{rate of synonymous substitution}} \]

but these rates need to be scaled to the neutral substitution process.

Productive vs. out-of-frame receptors


Each cell may carry two IGH alleles, but only one is expressed.


\[ \omega \equiv \frac{dN}{dS} \equiv \frac{\hbox{rate of non-synonymous substitution}}{\hbox{rate of synonymous substitution}} \]


Out-of-frame reads can be used to infer neutral mutation rate!

\(\omega_l\) is a ratio of rates in terms of observed neutral process

  • \(\lambda_l^{(N-I)}\): nonsynonymous in-frame rate for site \(l\)
  • \(\lambda_l^{(N-O)}\): nonsynonymous out-of-frame rate for site \(l\)
  • \(\lambda_l^{(S-I)}\): synonymous in-frame rate for site \(l\)
  • \(\lambda_l^{(S-O)}\): synonymous out-of-frame rate for site \(l\)



\[ \omega_l = \frac{\lambda_l^{(N-I)} / \lambda_l^{(N-O)}}{\lambda_l^{(S-I)} / \lambda_l^{(S-O)}} \]

Natural selection on “framework” region

Natural selection on “framework” region

Natural selection on “framework” region



  • B cell receptors are “drafted” and “revised” randomly, but

  • … with remarkably consistent deletion and insertion patterns
  • … with remarkably consistent substitution and selection


We can learn about these processes using model-based inference.

Thank you

  • National Science Foundation and National Institute of Health
  • University of Washington Center for AIDS Research (CFAR)
  • University of Washington eScience Institute


  • Ralph & M. (2016). Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation. PLOS Computational Biology.
  • Ralph & M. (2016). Likelihood-based inference of B-cell clonal families. arXiv, and in press, PLOS Computational Biology.
  • McCoy, Bedford, Minin, Bradley, Robins, & M. (2015). Quantifying evolutionary constraints on B-cell affinity maturation. Phil Trans Royal Soc London (B).

http://matsen.group/     http://B-T.CR/


HMM method outperforms VJ-CDR3 method

The hierarchical setup

  • Substitution rate \(\lambda_s \sim \operatorname{Gamma}(\omega_c, \theta_c)\)
  • Gamma mode \(\omega_c \sim \operatorname{Log-normal}(1, 1)\)
  • Gamma dispersion \(\theta_c \sim \text{Lévy}(3)\)
  • Stationary distribution \(\pi_s \sim \operatorname{Dirichlet}(3, 3, 3, 3)\)
  • Branch length \(t \sim \operatorname{Exponential}(0.1)\)

Points = rates; triangles = 95% CI of fit prior

Points = rates; triangles = 95% CI of fit prior

Are fit priors actually helpful?

  • Hold out data from three very frequently observed genes:
    IGHV3-23D, IGHV4-61, and IGHV5-51
  • Infer “gold standard” parameters for the three genes with a fit prior and, separately, a fixed prior using these data. Set aside.
  • Fit parameters for the data set using progressively more held-out data using a fit and a fixed prior.
  • Which one is closer to its gold-standard value?


Data from Adaptive Biotechnologies.

Fit priors are helpful for < 1000 reads

Tree principal components plot:

Points are germline genes.

Genes are positioned according to a principal coordinates projection of the within host evolution of that gene.


Note: the phylogenetic structure does not directly influence the position of the points.

Data: Adaptive individual A, subset to genes with 5000 or more observations

Random facts

  • Mean length of D segment in individual A’s naive repertoire is 16.61.
  • Subject A’s naive sequences were 37% CDR3
  • Divergence between the various germ-line V genes:
> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.003846 0.201300 0.344600 0.304700 0.384900 0.539500