Gibbs Clustering Massimo Andreatta, Morten Nielsen CBS, Department of Systems biology DTU, Denmark.

Slides:



Advertisements
Similar presentations
Cross validation, training and evaluation of data driven prediction methods Morten Nielsen Department of Systems Biology, DTU.
Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Hotspot Hunter: a computational system for large-scale screening and selection of candidate immunological hotspots in pathogen proteomes G.L. Zhang, A.M.
Gibbs sampling Morten Nielsen, CBS, BioSys, DTU. Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 1-month Practical Course: Genome Analysis Sequence comparison by ‘Sequence Harmony’
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Undergraduate Exercises with Trp Cage Paula Evans, Chet Fornari, Jeff Hansen, Jennifer Inlow, Larry Merkle.
05/27/2006 Modeling and Determining the Structures of Proteins and Macromolecular Assemblies Depts. of Biopharmaceutical Sciences and Pharmaceutical Chemistry.
Chemotaxis Pathway How can physics help? Davi Ortega.
MHC Polymorphism Ole Lund. Objectives What is HLA polymorphism? What is it good for? How does it make life difficult for vaccine design? Definition of.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Optimization methods Morten Nielsen Department of Systems biology, DTU.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Structural bioinformatics
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark Immunological Bioinformatics Processing, combined.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
MHC Polymorphism. MHC Class I pathway Figure by Eric A.J. Reits.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Thomas Blicher Center for Biological Sequence Analysis
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Tertiary protein structure modelling May 31, 2005 Graded papers will handed back Thursday Quiz#4 today Learning objectives- Continue to learn how to manipulate.
Selection of T Cell Epitopes Using an Integrative Approach Mette Voldby Larsen cand. scient. in biology ph.d. student.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
The Major Histocompatibility Complex And Antigen Presentation
Institute of Immunology, ZJU
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Algorithms in Bioinformatics Morten Nielsen Department of Systems Biology, DTU.
Phosphoproteomics and motif mining Martin Miller Ph.d. student CBS DTU
The dynamic nature of the proteome
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Identification of Cancer-Specific Motifs in
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Construction of Substitution Matrices
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Computational engineering of bionanostructures Ram Samudrala University of Washington How can we analyse, design, & engineer peptides capable of specific.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Motif Search and RNA Structure Prediction Lesson 9.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Optimization methods Morten Nielsen Department of Systems biology, DTU IIB-INTECH, UNSAM, Argentina.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Mean Field Theory and Mutually Orthogonal Latin Squares in Peptide Structure Prediction N. Gautham Department of Crystallography and Biophysics University.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
A Very Basic Gibbs Sampler for Motif Detection
Ligand Docking to MHC Class I Molecules
Secreted Fringe-like Signaling Molecules May Be Glycosyltransferases
Morten Nielsen, CBS, BioSys, DTU
Yvonne Groemping, Karine Lapouge, Stephen J. Smerdon, Katrin Rittinger 
Volume 5, Issue 3, Pages e5 (September 2017)
Improving SH3 domain ligand selectivity using a non-natural scaffold
Presentation transcript:

Gibbs Clustering Massimo Andreatta, Morten Nielsen CBS, Department of Systems biology DTU, Denmark

Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides of length 9-18 (even whole proteins can bind!) Binding cleft is open Binding core is 9 aa

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 3 Conventional Gibbs sampling MHC class II binding SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY 9 AAs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 4 Gibbs sampling - sequence alignment Why sampling? SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT... DFAAQVDYPSTGLY 50 sequences 12 amino acids long try all possible combinations with a 9-mer overlap 4 50 ~ possible combinations...computationally unfeasible

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 5 Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 6 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Accept or reject the move? Note that the probability of going to the new state depends on the previous state only Gibbs sampling - sequence alignment State transition

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 7 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Gibbs sampling - sequence alignment Numerical example - 1 Accept move with Prob = 100%

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 8 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 Gibbs sampling - sequence alignment Numerical example - 2 Accept move with Prob = 63.8%

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 9 Gibbs sampling - sequence alignment What is the MC temperature? iteration T it’s a scalar decreased during the simulation

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 10 Gibbs sampling - sequence alignment What is the MC temperature? iteration T it’s a scalar decreased during the simulation E.g. same dE=-0.3 but at different temperatures t 1 =0.4 t 3 =0.02 t 2 =0.1

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 11 Z f(Z) Move freely around states when the system is “warm”, then cool it off to force it into a state of high fitness

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 12 Does it work? HLA-DRB3*01:01

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 13 It works High TemperatureLow Temperature

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 14 Gibbs sampler. Prediction accuracy

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 15 Next generation peptide-chip technology

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 16 Identification of receptor binding motifs

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 17 Data interpretation PeptideSignal LMLSLFEQSLSCQAQ9 QGTDATKSIIFEAER-12 RLEEAQAYLAAGQHD10 EISELRTKVQEQQKQ44 FAGAKKIFGSLAFLP70 VRASSRVSGSFPEDS-7 CKAFFKRSIQGHNDY86 CEGCKAFFKRSIQGH100 RLSEADIRGFVAAVV-7

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 18 Or with multiple specificities? PeptideSignal LMLSLFEQSLSCQAQ9 QGTDATKSIIFEAER-12 RLEEAQAYLAAGQHD10 EISELRTKVQEQQKQ44 FAGAKKIFGSLAFLP70 VRASSRVSGSFPEDS-7 CKAFFKRSIQGHNDY86 CEGCKAFFKRSIQGH100 RLSEADIRGFVAAVV-7

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 19 Gibbs clustering (multiple specificities) --ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS---- AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK- ----SLFIGLKGDIRESTV-- --DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF ---SFSCIAIGIITLYLG IDQVTIAGAKLRSLN-- WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP Cluster 2 Cluster 1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK Multiple motifs !

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 20 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 21 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN 3Simple shiftRemove peptidePhase shift

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 22 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQDV --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 33. Finding the optimal configuration (the MC moves)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 23 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQDV --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4. Random shift of the core MMGMFNMLSTVLGVS 3. Finding the optimal configuration (the MC moves) 5. Score new core to log-odds matrices 6. Accept or reject move

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 24 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK--- --MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 3Simple shiftRemove peptidePhase shift

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 25 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 3Simple shiftRemove peptidePhase shift

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 26 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 5b. Score new core to all log-odds matrices E1E1 E2E2 E3E3 3Simple shiftRemove peptidePhase shift

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 27 The algorithm SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK 1. List of peptides2. create N random groups ----IDQVTIAGAKLRSLN- WIQKETLVTFKNPHAKKQD V --ELLEFHYYLSSKLNK AKSSPAYPSVLGQTI-- -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- ---ESLHNPYPDYHWLRT- RHLIFCHSKKKCDELAAK- ----VFRLKGGAPIKGVTF --LNKFISPKSVAGRFA-- DGEEEVQLIAAVPGK---- -MMGMFNMLSTVLGVS--- g1g1 g2g2 gNgN MMGMFNMLSTVLGVS 4b. Random shift of the core MMGMFNMLSTVLGVS 5b. Score new core to all log-odds matrices E1E1 E2E2 E3E3 6b. Accept or reject move 3Simple shiftRemove peptidePhase shift

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 28 The scoring function -SLFIGLKGDIRESTV-- ---SFSCIAIGIITLYLG KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL - -SLFIGLKGDIRESTV-- -SFSCIAIGIITLYLG-- KMLLDNINTPEGIIP--- -LNKYVHGTWRSILP--- --NKVKSLRILNTRRKL- P = Probability of accepting the move Avoid small specialized clusters (  = 10) Maximize intra cluster similarity whilst minimize inter cluster similarity

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 29 A mixture of 500 9mer peptides. How many motifs? OR

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 30 A mixture of 500 9mer peptides. How many motifs?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 31 Mixture of 100 binders for the two alleles ATDKAAAAYA*0101 EVDQTKIQYA*0101 AETGSQGVYB*4402 ITDITKYLYA*0101 AEMKTDAATB*4402 FEIKSAKKFB*4402 LSEMLNKEYA*0101 GELDRWEKIB*4402 LTDSSTLLVA*0101 FTIDFKLKYA*0101 TTTIKPVSYA*0101 EEKAFSPEVB*4402 AENLWVPVYB*4402 Two MHC class I alleles: HLA-A*0101 and HLA-B*4402

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 32 Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 A0101B4402 G 1 G 2 Mixed

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 33 Two MHC class I alleles: HLA-A*0101 and HLA-B* A0101B4402 G 1 G 2 Resolved Mixed

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 34 Five MHC class I alleles A010 1 A020 1 A030 1 B0702B4402 G 0 G 1 G 2 G 3 G 4

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 35 Five MHC class I alleles A A020 1 A030 1 B0702B4402 G 0 G 1 G 2 G 3 G 4 HLA-B % HLA-B % HLA-A % HLA-A % HLA-A %

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 36 The right number of specificities

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 37 How many motifs? =0

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 38 How many motifs? OR >0

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 39 Adding in alignment (MHC class II) HLA-DRB1*03:01HLA-DRB1*04:01 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 40 Dealing with noisy data Experimental data often contain false positives Outliers do not match any recurrent motif Introduce a garbage bin to collect ou tliers

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 41 Dealing with noisy data Introduce a garbage bin to collect outliers SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQD V KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK g1g2gntrash Move to the trash cluster peptides that do not match any motif

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 42 Dealing with noisy data 50 random sequences are added to the data set 200 binders to 3 MHC class I alleles 3 allelesHLA-A0101HLA-B0702HLA-B4001Random g g g trash

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 43 Dealing with noisy data 50 random sequences are added to the data set 200 binders to 3 MHC class I alleles 3 allelesHLA-A0101HLA-B0702HLA-B4001Random g g g trash DHHFTPQII NAFGWENAY SQTSYQYLI ELPIVTPAL ADKNLIKCS

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 44 Dealing with noisy data

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 45 SH3 domains - SH3 domains are small interaction modules abundantly found in eukaryotes - They preferentially bind proline- rich regions (minimum consensus sequence is PxxP) - Two main classes of ligands exist: Surface representation of the Hck SH3 domain bound to an artificial high- affinity peptide ligand, in green. Schmidt et al. (2007) class I +x  Px  P class II  Px  Px+ +: R or K  : hydrophobic x: any amino acid

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 46 SH3 domains - Gibbs clustering - Large data set of 2,457 unique peptide sequences * binding to Src SH3 domain - 12 amino acids long peptides, unaligned with respect to the binding core WVTAPRSLPVLP GSWVVDISNVED NYSGNRPLPGIW RSPIVRQLPSLP RALPVMPTNGPM AKSRPLPMVGLV VRPLPSPEGFGQ YRGRMLPVIWGT RALPLPRAYEGI VPLLPIRNGAVN PVRVLPNIPLSV... *Kim, T., et al. (2011) MUSI: an integrated system for identifying multiple specificity from large peptide or nucleic acid data sets, Nucleic Acids Res, 1-8.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 47 SH3 domains Gibbs clustering - 1 cluster class I +x  Px  P class II  Px  Px+ 2,350 sequences (107 to trash)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 48 SH3 domains Gibbs clustering - 2 clusters class I +x  Px  P class II  Px  Px+ (63 to trash) 466 sequences Class II 1,928 sequences Class I

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 49 SH3 domains Gibbs clustering - 3 clusters class I +x  Px  P class II  Px  Px+ 1,404 sequences560 sequences (48 to trash) 445 sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 50 SH3 domains class I +x  Px  P class II  Px  Px+ Optimal solution has two specificities

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 51

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 52 Acknowledgements Massimo Andreatta, PhD (for doing all the work) John Sidney (La Jolla Institute for Allergy and Immunology, California, USA) for the binding affinity validation of the predicted MHC class I outliers. The European Union 7th Framework Program FP7/ [grant number ]