Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Advertisements

Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Genes and Regulatory Elements
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Complexity (Running Time)
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Multiple testing correction
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Motif Detection Eric Xing Lecture 6,
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Projects
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
RNA1: Last week's take home lessons
Transcription factor binding motifs
LESSON 12 - Loops and Simulations
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Winter 2018 CISC101 12/1/2018 CISC101 Reminders
Sequence comparison: Dynamic programming
Tips & Tricks to make your life easier
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Finding regulatory modules
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Local alignment
Sequence comparison: Traceback
Quarter 1.
Sequence comparison: Multiple testing correction
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Transcription factor binding motifs
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Lecture 17 – Practice Exercises 3
CS a-spring-midterm-survey
Presentation transcript:

Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute response: Stuff folks liked The motif section is pretty interesting. It helps to talk about what we can actually apply these methods to. Going stepwise over the code is really helping. Good explanation and review of what we learned last time. Writing loops is fun (so far). I like reviewing p-values in the beginning since it can get confusing when switching applications. Looping practice is really helpful. Enjoyed learning about the p-value construction of motifs. x2 I liked comparing different ways to get p-values. Good step-by-step explanation of how to determine p-value from motif sequences. Python problems are getting more challenging, which is nice. X2 I like the ”bonus problems” at the – I get to them at home.

One-minute response: Pacing I thought the pace of both sections was good. Pace is just right. Lecture was good pace. Loved the pace and subject matter today. Good pacing, good content. Everything was presented at an appropriate pace & level. Best pacing so far, in my opinion. Pacing was good today. I like having lots of time to work on the sample problems. Pacing was good. Need more time to do the Python exercises.

One-minute response: Suggestions It would be great to have one or two more complex programming homework problems rather than several simple ones. People might be a little embarrassed to answer in-class questions when the answer is in front of them. I wish the class was two hours so we could spend a full hour on coding. It would have been nice to go over representing a motif as a PSSM briefly before learning about the p-value stuff. I got a little confused in the converting scores to p-values slides. x3 While loops are confusing me. Dynamic programming sample exercise could be helpful. I am still not sold on ever using a while loop instead of a for loop. I like talking about the key steps in a solution. If you could annotate those on the slides, that would be great. Homework solutions are very helpful.

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-1 sequences

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences starting with A

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences starting with A or C

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T All length-2 sequences

Dynamic programming for motif p-values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 A 10 6 8 C 7 G T

Motif discovery problem Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682 seq. 1 seq. 2 seq. 3

Motif discovery problem (hard version) Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences

Why is this hard? Input sequences are long (thousands or millions of residues). Motif may be subtle Instances are short. Instances are only slightly similar. ? ?

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx HAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE   xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs

Transcription Factor Binding Sites A Concrete Example: Transcription Factor Binding Sites We are given a set of promoters from co-regulated genes. TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

Transcription Factor Binding Sites A Concrete Example: Transcription Factor Binding Sites An unknown transcription factor binds to positions unknown to us, on either DNA strand. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

Transcription Factor Binding Sites A Concrete Example: Transcription Factor Binding Sites The DNA binding motif of the transcription factor can be described by a position-specific scoring matrix (PSSM). 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

Transcription Factor Binding Sites A Concrete Example: Transcription Factor Binding Sites The sequence motif discovery problem is to discover the sites (or the motif) given just the sequences. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

Gibbs sampling

Alternating approach Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied.

Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5

Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define weight matrix. Weight matrix defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen

Sampler step illustration: ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%

The MEME Algorithm MEME uses expectation maximization (EM) to discover sequence motifs. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 1: Randomly guess the positions (and strands) of the sites. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 2: Build a PSSM from the sites. Alignment PSSM 1 AAAAGAGTCA 2 AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA N AAATGAGTCA 12 … w i j PSSM Count Matrix A C G T Step 2: Build a PSSM from the sites. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

The MEME Algorithm Step 3: Scan each sequence with the motif. A C G T 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3

Expectation-maximization best_score = 0 best_pssm = 0 for index in range(0, len(sequence) - width): pssm = make_pssm(sequence[index:index+width]) counts = [] while (not equal(pssm, counts)): counts = scan(sequence, pssm) pssm = make_pssm(counts) if (score_pssm(pssm) > best_score): best_score = score_pssm(pssm) best_pssm = pssm

MEME algorithm do for (width = min; width *= 2; width < max) Consider various widths do for (width = min; width *= 2; width < max) for each possible starting point run 1 iteration of EM select candidate starting points for each candidate run EM to convergence select best motif erase motif occurrences until (motif score < threshold) Heuristic to speed things up Find multiple motifs in one data set

Comparison Both EM and Gibbs sampling involve iterating over two steps Convergence: EM converges when the PSSM stops changing. Gibbs sampling runs until you ask it to stop. Solution: EM may not find the motif with the highest score. Gibbs sampling will provably find the motif with the highest score, if you let it run long enough.