Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou,

Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins
Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou, Velculsecu, Karchin Presenter: Laura Zhou

Objectives Assess the performance of the most widely used and recently published machine learning methods to predict peptide-MHC binding Prediction performance (K-Tau, F1, AUC) Prediction speed Methods: Neural Network NetMHC NetMHCpan MHCflurry MHCnuggets (authors’ method) Bayesian framework SMMPMBEC Deep learning method HLA-CNN NetMHC/NetMHCpan: Single layer fully connected neural network MHCflurry: Neural Network SMMPMBEC: Bayesian framework for regularized least-squares regression HLA-CNN: deep learning convolution neural network

Outline Motivation Biology Background Methods: MHCnuggets
Methods: Prediction Performance Evaulators Application: Data overview Results Prediction Performance and Speed

Motivation Peptide binding to Major Histocompatibility Complex (MHC) is critical in human immune response Recent advances in cancer immunotherapy have encouraged the need to determine which peptides will bind to MHC proteins In particular, peptides with somatic mutations specific to patient’s tumor (neo-antigens) can inform treatment Experimental characterization of peptide-MHC binding is expensive and time-consuming  find computer modeling/simulation methods (in silico) to predict peptide-MHC affinities

MHC and Immune Process Intro
MHC proteins present on almost all cells, Will present antigens from pathogens (on macrophages) or self peptides (on own cell) to body’s receptors Critical to the activation of cytotoxic T-cells and determining organism’s immunological response to transplanted organs Recognizes the foreign fragment attached to the MHC molecule and stimulates an immune response critical to the activation of cytotoxic T-cells and determining organism’s immunological response to transplanted organs

MHC Intro cont. MHC I: mediates destruction of infected or malignant host cells (cellular immunity) by interacting with CD8 molecules on surfaces of cytotoxic T cells Peptides of 8-11 amino acids MHC II: creates immunological memory to specific pathogen Peptides of amino acids Binding affinity: strength of interaction between peptide and MHC after an initial response to a specific pathogen and will produce an enhanced response to subsequent encounters with that pathogen (adaptive immunity) by interacting with CD4 molecules on surfaces of helper T cells

MHCnuggets Method Gated recurrent unit neural network architecture (GRU) Accepts any length of peptides Each amino acid represented as 21-dimensional smooth one-hot encoded vector (0.9 and replace 1 and 0) Each network has fully connected layer of 64 hidden units Final output layer is single sigmoid unit y = max (0, 1 - log50K(IC50)) One hot encoding: process where categorical variables are converted into a form that allow ML algorithms to do a better job in prediction Essentially its reference coding Not sure of purpose for using 0.9 and 0.005 Ex. If peptide is 9 amino acid21*9 matrix of 0.9 and 0.005

MHCnuggets cont. Separate network trained for each MHC allele
Transfer learning protocol applied Train network for allele with largest number of training examples Use these weights to train network for all other MHC alleles All training examples of every allele tested for performance (i.e. AUC) using all networks in 2 If network that performed best was not original network, did a second round of training with best performing network’s weights Number of hidden units, dropout rate, and number of training epochs estimated by 3-fold cross validation on HLA allele with largest number of entries

Performance Assessment Statistics
Kendall-Tau correlation: 𝜏= 𝑛 𝑐 − 𝑛 𝑑 ( 𝑛 𝑜 − 𝑛 1 )( 𝑛 0 − 𝑛 2 ) , where 𝑛 𝑜 =𝑛(𝑛−1)/2 , 𝑛 1 = 𝑖 𝑡 𝑖 ( 𝑡 𝑖 −1)/2 , and n 2 = 𝑖 𝑢 𝑖 ( 𝑢 𝑖 −1)/2 We have n pairs (xexp,k,ypred,k) F1 Score: F1= 2 precision∗recall (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙) , where precision = (number of correctly predicted binders) / (true true binders) and recall = (number of correctly predicted binders) / (total peptides score) AUC is the area under the resulting curve of the positive rate and false positive rate at each possible score threshold. FPR = (number of non-binders predicted as binder)/(number of true non- binders) n_c: number of concordant pairs n_d: number of discordant pairs N= total number of experimental and predicted binding affinity pairs t_j= number of tied values in the ith group of tied experiemental affinities u_j= number of tied values in the jth group of tied predicted affinities

Data Immune Epitope Database (IEBD)
Affinities represented by half-maximal inhibitory concentration (IC50) value in nano-molar units of peptide to MHC molecules (nM) IC50: measure of the effectiveness of a substance in inhibiting a specific biological or biochemical function high IC50 = low binding affinity Processing: Kim et al. generated and partitioned into training and test sets Processed with shell script to remove any peptide in test set with identical length and >80% sequence identity in the training set Public set of experimentally characterized peptides and peptide-MHC binding affinity (calculated based on immunofluorescent arrays)

Data Snapshot

Data cont. Final Training (benchmark) dataset: Final Test set
106 unique MHC alleles 137,654 IC50 measurements Final Test set 51 unique MHC alleles 26,888 IC50 measurements All peptides in benchmark set consists of 8-11 amino acids residues

Applied MHCnuggets Networks trained for 200 epochs
Used back propagations with Adam optimizer and learning rate 0.001 Regularization with dropout and recurrent dropout probabilities of 0.2 Implemented in Keras python package

Train network for allele with largest number of training examples (Figure A)
Use these weights to train network for all other MHC alleles (Figure A) All training examples of every allele tested for performance (i.e. AUC) using all networks in 2. (Figure B) If network that performed best was not original network, did a second round of training with best performing network’s weights (Figure C) First trained network is HLA-A*02:01 binding peptides (A0201) Trained weights used to initialize 50 additional allele specific networks If prediction performance not best, was retrained (ex. B4403 or A2501)

Assessment 1: Prediction performance
Predictors trained on same set of peptide-MHC pairs (except NetMHC and NetMHCpan) Assessed predictions of peptide-MHC binding affinity with Kendall-Tau correlation (continuous prediction) F1 (binary prediction) AUC (binary prediction) Note: binary classification calculated from continuous values IC50 < 500nM  binder else  non-binder Kendall-Tau and F1 more stringent than AUC

Results: Prediction Performance
NetMHC, NetMHCpan, MHCflurry, and MHCnuggest have best prediction performance

Results: Prediction speed
Compute runtime for each method of all peptide-MHC pairs in test set with respect to MHC allele HLA-A*02:01 five times and averaged Extended to predict neo-antigens based on somatic mutations from whole-exome sequencing For each sample: ~4128 peptides and 6 potential MHC Class I alleles TCGA: The Cancer Genomic Atlas HNSC: head and neck cancer samples

Summary of results MHCnuggets: Best performance and speed
Allows for different peptides length (not padding or cutting)  more biologically suitable HLA-CNN: lowest prediction performance but fastest speed Deep learning assumes position is not important, but position of peptide binding very important Did not compare with sNebula and PSSMHCpan

Supplementary Slides

Peptide diversity Used network modularity
One network for each 51 MHC alleles in test set Nodes: peptide sequences found in training set of each allele Edges: if sequence identity between peptides was > 0 (i.o. shared at least 1 amino acid at same position. Weighted by sequence identity Normalized by peptide length Peptides cut and padded to length nine (though MHC nuggets didn’t need this)

Peptide diversity cont.
Weighted modularity: how closely knit communities are Large Q  dense connection within within, sparse connection between  higher diversity

Results of allele-specific differences
Number of training examples for each allele Imbalance between binder and non-binder examples in the training or test sets Peptide sequence diversity in the training examples

MHCnuggets Transfer learning
Transfer learning improved prediction performance (Figure S3) Compares change in prediction performance statistic with and without transfer learning

Brief comparisons of Other Methods
NetMHC/NetMHCpan Single layer fully connected neural network Pad/cut operations applied at every possible position, choosing strongest affinity for peptides shorter or longer than 9 amino acids Artificially generate non-binding peptides NetMCH: separate networks trained for each MHC allele, NetMCHpan: single network trained for all MHC alleles MHCflurry Neural network Discovers informative amino acid residue encodings and predicts peptide-MHC I binding affinities Same pad/cut operations as NetMHC/NetMHCpan, but use geometric mean of all adjusted versions of peptide Augments training data with peptides randomly generated from uniform distribution

Other methods cont. SMMPMBEC HLA-CNN
Bayesian framework for regularized least-squares regression HLA-CNN Deep learning convolution neural network Embedding layer, 2 1D convolution layers (32 filters), and fully connected layer Network trained for each MHC allele and each peptide length

Method: NetMHC Single layer fully connected neural network
Encodes 9 amino acid residue peptides with 378-length input vector Incorporates smoothed one-hot encoding (0.9 and 0.05 replace 1 and 0) and a BLOSUM-62 encoding of each amino acid Padding or cutting applied at every possible position to peptides shorter or longer than 9 residues Strongest affinity selected Separate networks are trained for each MHC allele

Method: NetMHCpan

Method: MCHflurry

Method: SMMPMBEC

Method: HLA-CNN A Note: other neural network methods were developed (see supplementary slides)

Other neural network methods developed

Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou,

Similar presentations

Presentation on theme: "Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou,

Similar presentations

Presentation on theme: "Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins Bhattacharya, Sivakumar, Tokheim, Belva Guthrie, Anagnostou,"— Presentation transcript:

Similar presentations

About project

Feedback