CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Processing of intracellular proteins MHC binding
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU What makes a peptide a potential and effective epitope? Part of a pathogen protein Successful processing –Proteasome cleavage –TAP binding Binds to MHC molecule Protein function –Early in replication Sequence conservation in evolution Sars virus
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU From proteins to immunogens Lauemøller et al., % processed0.5% bind MHC50% CTL response => 1/2000 peptide are immunogenic
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Location of class I epitopes GP1200 protein Structure (1GM9)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU
MHC class I with peptide Anchor positions
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Prediction of HLA binding specificity Simple Motifs –Allowed/non allowed amino acids Extended motifs –Amino acid preferences (SYFPEITHI)SYFPEITHI) –Anchor/Preferred/other amino acids Hidden Markov models –Peptide statistics from sequence alignment Neural networks –Can take sequence correlations into account
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Where to get data? SYFPEITHI database –3500 peptides known to bind to HMC class I and II –Only published data MHCpep –13000 peptides known to bind to HMC class I and II –Published data and direct submission –No update since 1998 Binding affinity assays –Quantitative data. How strong does a peptide bind to the MHC molecule? –Costly and people do not publish negative results..
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Databases and web resources HLA Informatics Group, ANRI (HLA sequence database) IMGT/HLA Database (HLA sequence database) SYFPEITHI (Database of HLA Class I and II peptides) MHCPEP (Database of HLA Class I and II peptides) BIMAS (HLA Class I predictor) SYFPEITHI (HLA Class I predictor) NetMHC (HLA Class I prediction)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence logo Height of a column equal to log 20 + p log p Relative height of a letter is p Highly useful tool to visualize sequence motifs High information positions MHC class I HLA-A0201
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Characterizing a binding motif ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV 10 peptides known to bind MHC What can we learn? 1.A at P1 favors binding? 2.I is not allowed at P9? 3.K at P4 favors binding?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information Description of binding motif Example P A = 6/10 P G = 2/10 P T = P K = 1/10 P C = P D = …P V = 0 Problems –Few data –Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information Raw sequence counting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pseudo-count and sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Poor or biased sampling of sequence space I is not found at position P9. Does this mean that I is forbidden? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9 } Similar sequences Weight 1/5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU The Blosum matrix
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Pseudo counts Sequence weighting and pseudo count –Prediction accuracy 0.60 Motif found on all data (485) –Prediction accuracy 0.79
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Weight matrices Estimate amino acid frequencies from alignment Now a weight matrix is given as W ij = log(p ij /q j ) –Here i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and adding L values from matrix
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU How to predict The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). –Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. Artificial neural networks (ANN) are ideally suited to take such correlations into account
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Neural networks Neural networks can learn higher order correlations! –What does this mean? A A => 0 A C => 1 C A => 1 C C => 0 No linear function can learn this pattern
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Neural networks w 11 w 12 v1v1 w 21 w 22 v2v2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy True positive proportion = TP/(AP)False positive proportion = FP/(AN) A roc =0.5 A roc =0.8 Roc curves Pearson correlation TPFP AP AN
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope predictions Sequence motif and HMM’s Sequence motifHMM cc: 0.76 A roc : 0.92 cc: 0.80 A roc : 0.95
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Epitope prediction. Neural Networks cc: 0.91 A roc : 0.98
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of prediction accuracy
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Hepatitis C virus. Epitope predictions
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Proteasomal cleavage Netchop ( –Epitopes have strong C terminal cleavage –Epitopes can have strong internal cleavage sites Selection strategy –High binding peptides –High cleavage probability at C terminal NMVPFFPPV..S.....S
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Hvad nu? 29 marts. Introduktion til hidden Markov models og weight matrices 5 april. Introduktion til neural networks 12 april. Introduktion til projekt 10 maj. Aflever projekt