CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Processing of intracellular proteins MHC binding
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU What makes a peptide a potential and effective epitope? Part of a pathogen protein Successful processing –Proteasome cleavage –TAP binding Binds to MHC molecule Protein function –Early in replication Sequence conservation in evolution
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU From proteins to immunogens Lauemøller et al., % processed0.5% bind MHC50% CTL response => 1/2000 peptide are immunogenic
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC Class I and II Class I –Peptides 8-12 amino acids long –Intracellular pathogen presentation –Broad range of bioinformatical prediction tools Class II –Peptides 13+ amino acids long –Intravesicular pathogen presentation –Few prediction tools
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class I with peptide
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Prediction of HLA binding specificity Simple Motifs –Allowed/non allowed amino acids Extended motifs –Amino acid preferences (SYFPEITHI)SYFPEITHI) –Anchor/Preferred/other amino acids Hidden Markov models –Peptide statistics from sequence alignment Neural networks –Can take sequence correlations into account
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Syfpeithi database Anchors: Required for binding Auxiliary anchor: Helps binding
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pattern recognition 10 peptides from MHCpep database –Bind to the MHC complex A*0201 Which of the following are most likely to bind? 1.FLLTRILTI 2.WLDQVPFSV 3.TVILGVLLL Regular expression –X 1 [LMIV] 2 X 3 …X 8 [MVL] 9 –2 and 3 will bind and 1 will not bind –Cannot tell if 2 if more likely to bind Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind A probabilistic model can capture this! ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Probability estimation ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight matrices Estimate amino acid frequencies from alignment Now a weight matrix is given as W ij = log(p ij /q j ) –Here i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. In nature not all amino acids are found equally often –P A = 0.07, P W = –Finding 6% A is hence not significant, but 6% W highly significant W is a L x 20 matrix, L is motif length
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight-matrix construction Example from real life 10 peptides from MHCpep database Bind the MHC complex Estimate sequence motif and weight matrix Evaluate on 528 peptides (not included in training) ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pseudo-count and sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Limited number of data Poor or biased sampling of sequence space I is not found at position P9. Does this mean that I is forbidden? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9 } Similar sequences Weight 1/5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Low count correction using Blosum matrices # I L V L V Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important N eff : Number of sequences : Weight on prior or pseudo count
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Raw sequence counting –No sequence weighting –No pseudo count –Prediction accuracy 0.45 Sequence weighting –No pseudo count –Prediction accuracy 0.5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Sequence weighting and pseudo count –Prediction accuracy 0.60 Sequence weighting, pseudo count and anchor weighting –Prediction accuracy 0.72
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Sequence weighting, pseudo count and anchor weighting –Prediction accuracy 0.72 Motif found on all data (485) –Prediction accuracy 0.79
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Training on small data sets Class I Class II Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU How to predict The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). –Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. Artificial neural networks (ANN) are ideally suited to take such correlations into account
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Neural networks Neural networks can learn higher order correlations! –What does this mean? 0 0 => => => => 0 No linear function can learn this pattern
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Learning higher order correlation 0 0 => 0; 1 0 => => 0; 0 1 => 1 X1X1 W1W1 W2W2 X2X2 0 W 11 X1X1 W 22 X2X2 0 W 21 W 12 V2V2 V1V1 h1h1 hshs Has no solution! Solution
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Mutual information I(i,j) = aa i aa j P(aa i, aa j ) * log[P(aa i, aa j )/P(aa i )*P(aa j )] P(G 1 ) = 2/9 = 0.22,.. P(V 6 ) = 4/9 = 0.44,.. P(G 1,V 6 ) = 2/9 = 0.22, P(G 1 )*P(V 6 ) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS P1 P6
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope predictions Mutual information 313 binding peptides313 random peptides
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Choice of method Neural networks are superior when trained on many data Simple and extended motif method when little or no data is available HMM/weight matrices with position specific differential weight otherwise –Increase weight on anchor positions
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy True positive proportion = TP/(AP)False positive proportion = FP/(AN) A roc =0.5 A roc =0.8 Roc curves Pearson correlation TPFP AP AN
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Construction of ROC curves True positive proportion = TP/(AP)False positive proportion = FP/(AN) A roc =0.5 A roc =0.8 Roc curves Number Sequence Assignment Prediction 1 ILYQVPFSV YLEPGPVTV GLMTAVYLV YLDLALMSV GLYSSTVPV HLYQGCQVV RMYGVLPWI FLPWHRLFL LLPSLFLLL ILSSLGLPV FLLTRILTI ILDEAYVMA VVMGTLVAL MALLRLPLV MLQDMAILT KILSVFFLA ILTVILGVL ALAKAAAAA LVSLLTFMI ALPYWNFAT >0.5 AP (16) <0.5 AN (4) TP=3,FP=0 TP=11,FP=1 TP=16,FP=4
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope predictions Sequence motif and HMM’s Sequence motif HMM cc: 0.76 A roc : 0.92 cc: 0.80 A roc : 0.95
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope prediction. Neural Networks cc: 0.91 A roc : 0.98
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Location of class I epitopes GP1200 protein Structure (1GM9)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Hepatitis C virus. Epitope predictions
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC Class II binding TEPITOPE. Virtual matrices (Hammer, J., Current Opinion in Immunology 7, , 1995) PROPRED. Quantitative matrices (Singh H, Raghava GP Bioinformatics 2001 Dec;17(12):1236-7) –Web interface Gibbs sampler (Nielsen et al., Bioinformatics Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class II prediction Complexity of problem –Peptides of different length –Weak motif signal Alignment crucial Gibbs Monte Carlo sampler RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTIE
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Class II binding motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI Gibbs sampler motifAlignment by Gibbs sampler
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class II predictions Allele DRB1_0401 Accuracy
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Summary Binding motif of class I MHC binding well characterized by HMM/weight matrices –This even when limited data is available Neural networks can be trained to predict MHC binding with high accuracy –NN can include higher order sequence correlations MHC Class II peptide binding motif can be described using a Gibbs sampler algorithm