CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Slides:



Advertisements
Similar presentations
Sequence motifs, information content, logos, and HMM’s
Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Project in Immunological Bioinformatics Morten Nielsen, CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Gibbs sampling Morten Nielsen, CBS, BioSys, DTU. Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Computer Aided Vaccine Design Dr G P S Raghava. Concept of Drug and Vaccine Concept of Drug Concept of Drug –Kill invaders of foreign pathogens –Inhibit.
MHC Polymorphism Ole Lund. Objectives What is HLA polymorphism? What is it good for? How does it make life difficult for vaccine design? Definition of.
Artificial Neural Networks 2 Morten Nielsen BioSys, DTU.
Optimization methods Morten Nielsen Department of Systems Biology, DTU.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Class I pathway Prediction of proteasomal cleavage and TAP binding.
Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark Immunological Bioinformatics Processing, combined.
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
MHC Polymorphism. MHC Class I pathway Figure by Eric A.J. Reits.
Performance measures Morten Nielsen, CBS, BioCentrum, DTU.
Class I pathway Prediction of proteasomal cleavage and TAP binidng Morten Nielsen, CBS, BioCentrum, DTU.
Class I pathway Prediction of proteasomal cleavage and TAP binidng Can Keşmir, TBB, Utrecht University, NL & CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Introduction to bioinformatics
Selection of T Cell Epitopes Using an Integrative Approach Mette Voldby Larsen cand. scient. in biology ph.d. student.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Similar Sequence Similar Function Charles Yan Spring 2006.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
Biological sequence analysis and information processing by artificial neural networks.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Epitope Selection Rational Vaccine design. Why? Therapeutic vaccines Therapeutic vaccines Treatment of viral infections (e.g., HIV, HCV), and resistant.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Prediction of CTL responses Mette Voldby Larsen cand. scient. in biology ph.d. student.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...
Algorithms in Bioinformatics Morten Nielsen Department of Systems Biology, DTU.
Selection of T Cell Epitopes Using an Integrative Approach Mette Voldby Larsen cand. scient. in Biology PhD in Immunological Bioinformatics.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Performance measures Morten Nielsen, CBS, Department of Systems Biology, DTU.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence motifs, information content, logos, and HMM’s
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Processing of intracellular proteins MHC binding

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU What makes a peptide a potential and effective epitope? Part of a pathogen protein Successful processing –Proteasome cleavage –TAP binding Binds to MHC molecule Protein function –Early in replication Sequence conservation in evolution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU From proteins to immunogens Lauemøller et al., % processed0.5% bind MHC50% CTL response => 1/2000 peptide are immunogenic

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC Class I and II Class I –Peptides 8-12 amino acids long –Intracellular pathogen presentation –Broad range of bioinformatical prediction tools Class II –Peptides 13+ amino acids long –Intravesicular pathogen presentation –Few prediction tools

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class I with peptide

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Prediction of HLA binding specificity Simple Motifs –Allowed/non allowed amino acids Extended motifs –Amino acid preferences (SYFPEITHI)SYFPEITHI) –Anchor/Preferred/other amino acids Hidden Markov models –Peptide statistics from sequence alignment Neural networks –Can take sequence correlations into account

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Syfpeithi database Anchors: Required for binding Auxiliary anchor: Helps binding

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pattern recognition 10 peptides from MHCpep database –Bind to the MHC complex A*0201 Which of the following are most likely to bind? 1.FLLTRILTI 2.WLDQVPFSV 3.TVILGVLLL Regular expression –X 1 [LMIV] 2 X 3 …X 8 [MVL] 9 –2 and 3 will bind and 1 will not bind –Cannot tell if 2 if more likely to bind Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind A probabilistic model can capture this! ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Probability estimation ALAKAAAAM ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight matrices Estimate amino acid frequencies from alignment Now a weight matrix is given as W ij = log(p ij /q j ) –Here i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. In nature not all amino acids are found equally often –P A = 0.07, P W = –Finding 6% A is hence not significant, but 6% W highly significant W is a L x 20 matrix, L is motif length

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight-matrix construction Example from real life 10 peptides from MHCpep database Bind the MHC complex Estimate sequence motif and weight matrix Evaluate on 528 peptides (not included in training) ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pseudo-count and sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Limited number of data Poor or biased sampling of sequence space I is not found at position P9. Does this mean that I is forbidden? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9 } Similar sequences Weight 1/5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Low count correction using Blosum matrices # I L V L V Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important N eff : Number of sequences  : Weight on prior or pseudo count

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Raw sequence counting –No sequence weighting –No pseudo count –Prediction accuracy 0.45 Sequence weighting –No pseudo count –Prediction accuracy 0.5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Sequence weighting and pseudo count –Prediction accuracy 0.60 Sequence weighting, pseudo count and anchor weighting –Prediction accuracy 0.72

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example from real life (cont.) Sequence weighting, pseudo count and anchor weighting –Prediction accuracy 0.72 Motif found on all data (485) –Prediction accuracy 0.79

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Training on small data sets Class I Class II Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU How to predict The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). –Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. Artificial neural networks (ANN) are ideally suited to take such correlations into account

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Neural networks Neural networks can learn higher order correlations! –What does this mean? 0 0 => => => => 0 No linear function can learn this pattern

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Learning higher order correlation 0 0 => 0; 1 0 => => 0; 0 1 => 1 X1X1 W1W1 W2W2 X2X2 0 W 11 X1X1 W 22 X2X2 0 W 21 W 12 V2V2 V1V1 h1h1 hshs Has no solution! Solution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Mutual information I(i,j) =  aa i  aa j P(aa i, aa j ) * log[P(aa i, aa j )/P(aa i )*P(aa j )] P(G 1 ) = 2/9 = 0.22,.. P(V 6 ) = 4/9 = 0.44,.. P(G 1,V 6 ) = 2/9 = 0.22, P(G 1 )*P(V 6 ) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS P1 P6

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope predictions Mutual information 313 binding peptides313 random peptides

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Choice of method Neural networks are superior when trained on many data Simple and extended motif method when little or no data is available HMM/weight matrices with position specific differential weight otherwise –Increase weight on anchor positions

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy True positive proportion = TP/(AP)False positive proportion = FP/(AN) A roc =0.5 A roc =0.8 Roc curves Pearson correlation TPFP AP AN

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Construction of ROC curves True positive proportion = TP/(AP)False positive proportion = FP/(AN) A roc =0.5 A roc =0.8 Roc curves Number Sequence Assignment Prediction 1 ILYQVPFSV YLEPGPVTV GLMTAVYLV YLDLALMSV GLYSSTVPV HLYQGCQVV RMYGVLPWI FLPWHRLFL LLPSLFLLL ILSSLGLPV FLLTRILTI ILDEAYVMA VVMGTLVAL MALLRLPLV MLQDMAILT KILSVFFLA ILTVILGVL ALAKAAAAA LVSLLTFMI ALPYWNFAT >0.5 AP (16) <0.5 AN (4) TP=3,FP=0 TP=11,FP=1 TP=16,FP=4

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope predictions Sequence motif and HMM’s Sequence motif HMM cc: 0.76 A roc : 0.92 cc: 0.80 A roc : 0.95

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope prediction. Neural Networks cc: 0.91 A roc : 0.98

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Location of class I epitopes GP1200 protein Structure (1GM9)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Hepatitis C virus. Epitope predictions

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC Class II binding TEPITOPE. Virtual matrices (Hammer, J., Current Opinion in Immunology 7, , 1995) PROPRED. Quantitative matrices (Singh H, Raghava GP Bioinformatics 2001 Dec;17(12):1236-7) –Web interface Gibbs sampler (Nielsen et al., Bioinformatics Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class II prediction Complexity of problem –Peptides of different length –Weak motif signal Alignment crucial Gibbs Monte Carlo sampler RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTIE

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Class II binding motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI Gibbs sampler motifAlignment by Gibbs sampler

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU MHC class II predictions Allele DRB1_0401 Accuracy

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Summary Binding motif of class I MHC binding well characterized by HMM/weight matrices –This even when limited data is available Neural networks can be trained to predict MHC binding with high accuracy –NN can include higher order sequence correlations MHC Class II peptide binding motif can be described using a Gibbs sampler algorithm