Gibbs Clustering Massimo Andreatta, Morten Nielsen CBS, Department of Systems biology DTU, Denmark
Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides of length 9-18 (even whole proteins can bind!) Binding cleft is open Binding core is 9 aa
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 3 Conventional Gibbs sampling MHC class II binding SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY 9 AAs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 4 Gibbs sampling - sequence alignment Why sampling? SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY 50 sequences 12 amino acids long try all possible combinations with a 9-mer overlap 4 50 ~ possible combinations...computationally unfeasible
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 5 Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 6 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Accept or reject the move? Note that the probability of going to the new state depends on the previous state only Gibbs sampling - sequence alignment State transition
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 7 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Gibbs sampling - sequence alignment Numerical example - 1 Accept move with Prob = 100%
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 8 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Gibbs sampling - sequence alignment Numerical example - 2 Accept move with Prob = 63.8%
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 9 Gibbs sampling - sequence alignment What is the MC temperature? iteration T it’s a scalar decreased during the simulation
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 10 Gibbs sampling - sequence alignment What is the MC temperature? iteration T it’s a scalar decreased during the simulation E.g. same dE=-0.3 but at different temperatures t 1 =0.4 t 3 =0.02 t 2 =0.1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 11 Z f(Z) Move freely around states when the system is “warm”, then cool it off to force it into a state of high fitness
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 12 Does it work? HLA-DRB3*01:01
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 13 It works High TemperatureLow Temperature
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 14 Gibbs sampler. Prediction accuracy
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 15 Next generation peptide-chip technology
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 16 Identification of receptor binding motifs
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 17 Data interpretation PeptideSignal LMLSLFEQSLSCQAQ9 QGTDATKSIIFEAER-12 RLEEAQAYLAAGQHD10 EISELRTKVQEQQKQ44 FAGAKKIFGSLAFLP70 VRASSRVSGSFPEDS-7 CKAFFKRSIQGHNDY86 CEGCKAFFKRSIQGH100 RLSEADIRGFVAAVV-7
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 18 Or with multiple specificities? PeptideSignal LMLSLFEQSLSCQAQ9 QGTDATKSIIFEAER-12 RLEEAQAYLAAGQHD10 EISELRTKVQEQQKQ44 FAGAKKIFGSLAFLP70 VRASSRVSGSFPEDS-7 CKAFFKRSIQGHNDY86 CEGCKAFFKRSIQGH100 RLSEADIRGFVAAVV-7
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 19 Gibbs clustering (multiple specificities) --ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS---- AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK- ----SLFIGLKGDIRESTV-- --DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF ---SFSCIAIGIITLYLG IDQVTIAGAKLRSLN-- WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP Cluster 2 Cluster 1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK Multiple motifs !
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 20 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 21 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN 3Simple shiftRemove peptidePhase shift
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 22 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQDV --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 33. Finding the optimal configuration (the MC moves)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 23 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQDV --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4. Random shift of the core MMGMFNMLSTVLGVS 3. Finding the optimal configuration (the MC moves) 5. Score new core to log-odds matrices 6. Accept or reject move
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 24 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 3Simple shiftRemove peptidePhase shift
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 25 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 3Simple shiftRemove peptidePhase shift
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 26 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 5b. Score new core to all log-odds matrices E1E1 E2E2 E3E3 3Simple shiftRemove peptidePhase shift
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 27 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- -MMGMFNMLSTVLGVS--- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 5b. Score new core to all log-odds matrices E1E1 E2E2 E3E3 6b. Accept or reject move 3Simple shiftRemove peptidePhase shift
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 28 The scoring function -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL - -SLFIGLKGDIRESTV-- -SFSCIAIGIITLYLG-- KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- P = Probability of accepting the move Avoid small specialized clusters ( = 10) Maximize intra cluster similarity whilst minimize inter cluster similarity
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 29 A mixture of 500 9mer peptides. How many motifs? OR
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 30 A mixture of 500 9mer peptides. How many motifs?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 31 Mixture of 100 binders for the two alleles ATDKAAAAYA*0101 EVDQTKIQYA*0101 AETGSQGVYB*4402 ITDITKYLYA*0101 AEMKTDAATB*4402 FEIKSAKKFB*4402 LSEMLNKEYA*0101 GELDRWEKIB*4402 LTDSSTLLVA*0101 FTIDFKLKYA*0101 TTTIKPVSYA*0101 EEKAFSPEVB*4402 AENLWVPVYB*4402 Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 32 Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 A0101B4402 G 1 G 2 Mixed
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 33 Two MHC class I alleles: HLA-A*0101 and HLA-B* A0101B4402 G 1 G 2 Resolved Mixed
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 34 Five MHC class I alleles A010 1 A020 1 A030 1 B0702B4402 G 0 G 1 G 2 G 3 G 4
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 35 Five MHC class I alleles A A020 1 A030 1 B0702B4402 G 0 G 1 G 2 G 3 G 4 HLA-B % HLA-B % HLA-A % HLA-A % HLA-A %
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 36 The right number of specificities
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 37 How many motifs? =0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 38 How many motifs? OR >0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 39 Adding in alignment (MHC class II) HLA-DRB1*03:01HLA-DRB1*04:01 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 40 Dealing with noisy data Experimental data often contain false positives Outliers do not match any recurrent motif Introduce a garbage bin to collect ou tliers
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 41 Dealing with noisy data Introduce a garbage bin to collect outliers SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK g1g2gntrash Move to the trash cluster peptides that do not match any motif
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 42 Dealing with noisy data 50 random sequences are added to the data set 200 binders to 3 MHC class I alleles 3 allelesHLA-A0101HLA-B0702HLA-B4001Random g g g trash
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 43 Dealing with noisy data 50 random sequences are added to the data set 200 binders to 3 MHC class I alleles 3 allelesHLA-A0101HLA-B0702HLA-B4001Random g g g trash DHHFTPQII NAFGWENAY SQTSYQYLI ELPIVTPAL ADKNLIKCS
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 44 Dealing with noisy data
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 45 SH3 domains - SH3 domains are small interaction modules abundantly found in eukaryotes - They preferentially bind proline- rich regions (minimum consensus sequence is PxxP) - Two main classes of ligands exist: Surface representation of the Hck SH3 domain bound to an artificial high- affinity peptide ligand, in green. Schmidt et al. (2007) class I +x Px P class II Px Px+ +: R or K : hydrophobic x: any amino acid
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 46 SH3 domains - Gibbs clustering - Large data set of 2,457 unique peptide sequences * binding to Src SH3 domain - 12 amino acids long peptides, unaligned with respect to the binding core WVTAPRSLPVLP GSWVVDISNVED NYSGNRPLPGIW RSPIVRQLPSLP RALPVMPTNGPM AKSRPLPMVGLV VRPLPSPEGFGQ YRGRMLPVIWGT RALPLPRAYEGI VPLLPIRNGAVN PVRVLPNIPLSV... *Kim, T., et al. (2011) MUSI: an integrated system for identifying multiple specificity from large peptide or nucleic acid data sets, Nucleic Acids Res, 1-8.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 47 SH3 domains Gibbs clustering - 1 cluster class I +x Px P class II Px Px+ 2,350 sequences (107 to trash)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 48 SH3 domains Gibbs clustering - 2 clusters class I +x Px P class II Px Px+ (63 to trash) 466 sequences Class II 1,928 sequences Class I
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 49 SH3 domains Gibbs clustering - 3 clusters class I +x Px P class II Px Px+ 1,404 sequences560 sequences (48 to trash) 445 sequences
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 50 SH3 domains class I +x Px P class II Px Px+ Optimal solution has two specificities
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 51
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 52 Acknowledgements Massimo Andreatta, PhD (for doing all the work) John Sidney (La Jolla Institute for Allergy and Immunology, California, USA) for the binding affinity validation of the predicted MHC class I outliers. The European Union 7th Framework Program FP7/ [grant number ]