Predicting T Cell Receptor Specificity with Deep Learning

Published in

Becoming Human: Artificial Intelligence Magazine

5 min readFeb 23, 2019

T cells attacking a cancer cell (Image source: https://www.immunology.org/publications/immunology-news/immunology-news-september-2016/car-tcell-immunotherapy-%E2%80%93-perspective)

Our immune system protects us from various infectious diseases and cancers, and T cells are “foot soldiers” of the immune system and are essential for protective immunity. T cells recognize infected and cancer cells via the interactions between T cell receptors (TCRs) that are expressed on the surface of T cells and peptide-major histocompatibility complex (MHC) complexes that are expressed on the surface of target cells. The peptides that are presented by MHC molecules are termed T cell epitopes and are derived from pathogenic or tumor antigens. There are many tools available to predict T cell epitopes, i.e. which peptides derived from the antigen of interest will be presented by host MHC molecules. However, the antigen specificity of T cells is much less well studied. In other words, for a given TCR, can we predict what epitope it recognizes? In this post, we will try to tackle this problem using deep learning.

T cell recognition of peptide-MHC complex (Image adapted from Immatics)

Data

I downloaded the data from VDJdb, which is a curated database hosting over thirty-thousand TCR sequences with known antigen specificities from human, monkey, and mouse. As shown below, each entry is a TCR. The majority of TCRs consists of two chains including alpha (TRA) and beta (TRB), of which TRB is more important for determining the antigen specificity of TCRs. Each TCR chain has a variable (V) segment and a joining (J) segment, and the V segment dictates the antigen specificity of a TCR. There are three complementarity-determining regions (CDRs), namely CDR1, CDR2, and CDR3, of which CDR3 is the most important for determining antigen specificity. The database also includes the MHC and epitope that the TCR binds to as well as the organism and antigen that the epitope is derived from. Please note that each TCR is specific for one epitope, whereas one epitope can be recognized by multiple TCRs. In the screenshot below, all the TCRs recognize the epitope LLWNGPMAV derived from the NS4B of yellow fever virus.

A snapshot of VDJdb. Each entry is a TCR.

Defining the problem

There are a few hundred unique epitopes (i.e. TCR specificities) in the database. Here, we consider a simplified case: we pick two epitopes and build a model using deep learning to predict which one of the two epitopes a TCR recognizes, which is essentially a binary classification problem like cats and dogs.

Our TCR specificity prediction problem is basically binary classification, instead of cat versus dog, we predict two epitopes. (Image source: https://www.google.com/about/main/machine-learning-qa/)

Building a deep learning model based on embedding and CNN

We only consider human TRB chains in this post and will first count the number of TRB chains for each epitope and then pick the top two. We will then build a classifier for these two epitopes, which are essentially our cat and dog.

As we can see, the epitopes NLVPMVATV and GILGFVFTL have a few thousand TCRs and should give us enough data to train our model. For each TCR, we will use its CDR3 sequence, V segment, J segment, and MHC alpha chain as features for prediction. Since CDR3 sequences have different lengths, we add 0 paddings to make them of the same length. Let’s take a look at the preprocessed data:

Preprocessed data

We will use embeddings for V segments, J segments, and MHC alpha chains. For CDR3 sequences, we will first use embeddings for the amino acids that constitute CDR3 sequences and then apply 1-dimensional convolutional neural network (CNN) to the sequences. Finally, we will concatenate them together and feed them through a few fully connected layers. Please read my previous posts Detecting Pneumonia with Deep Learning and Predicting peptide immunogenicity with deep learning for more details on CNN and embeddings. The overall neural network structure looks like below:

Implementation

The model was implemented using the fastai library, and the implementation details can be found in the accompanying notebook.

Results

The model achieved an AUC score of 0.77 on a test set that was not touched during training. The confusion matrix shows that the classifier performs better on NLVPMVATV-specific TCRs than GLGFVFTL-specific ones.

Conclusions

Here we built a binary classifier for TCRs that are specific for two epitopes. The ultimate goal would be to predict the specificity of any TCR and decipher the determinants of TCR specificity. This type of research would have broad applications for the development of novel therapies and vaccines against infectious diseases and cancers.