Download presentation
Presentation is loading. Please wait.
Published bySylvia Byrd Modified over 9 years ago
1
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.
2
2 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
3
3 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
4
4 DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes T C C AA C GG T G C T G A G G T G C AC Gene Protein DNA
5
5 Regulation of Genes What turns genes on (producing a protein) and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? At what rate is the gene product produced?
6
6 Regulation of Genes Gene Regulatory Element Transcription Factor (Protein) DNA RNA polymerase (Protein)
7
7 Regulation of Genes DNA Regulatory Element Gene Transcription Factor (Protein) RNA polymerase (Protein)
8
8 Regulation of Genes RNA polymerase (Protein) DNA New protein Regulatory Element Gene Transcription Factor (Protein)
9
9 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short sequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)
10
10 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
11
11 2 Types of Motif Discovery 1.Motif discovery by overrepresentation One species Multiple (co-regulated) genes 2.Motif discovery by phylogenetic footprinting Multiple species One gene
12
12 Overrepresentation: Daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150
13
13 Phylogenetic Footprinting: Regulatory Element of Growth Hormone Gene -200 Chicken Rat Human Dog Sheep AGGGGATA AGGGTATA
14
14 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
15
15 MEME (Multiple EM for Motif Elicitation) Bailey & Elkan, 1995 Very general iterative method based on Expectation Maximization Available at meme.sdsc.edu/meme/website/intro.html
16
16 Overrepresented Motifs Given sequences X = {X 1, X 2, …, X n }, find statistically overrepresented motifs of length k For simplicity, assume –Exactly one motif instance per sequence –Sequences over DNA alphabet
17
17 Hidden Information Z = {Z ij }, where 1,if motif instance starts at Z ij =position j of X i 0,otherwise Iterate over probabilistic models that could generate X and Z, trying to converge on this solution {
18
18 Model Parameters Motif profile: 4×k matrix θ = (θ rp ), r {A,C,G,T} 1 p k θ rp = Pr(residue r in position p of motif) Background distribution: θ r0 = Pr(residue r in random nonmotif position)
19
19 Profile Example GTTGTC 000.400 GTTTCC 0.200.81 GCTACC 100.200 GTTACC 0.81.4.20 GTTTCC profile θ
20
20 Overview: Expectation Maximization Goal: Find profile θ and motif positions Z that have maximum likelihood At each iteration: –E-step: From θ predict likely motif positions Z –M-step: From sequences at positions Z compute new profile θ
21
21 Expectation Maximization Goal: Find θ, Z that maximize Pr (X, Z | θ ) At iteration t: –E-step: Z (t) = E (Z | X, θ (t) ) –M-step: Find θ (t+1) that maximizes Pr (X, Z (t) | θ (t+1) )
22
22 E-step Details Z ij (t) = Pr(X i | Z ij =1, θ (t) ) Σ j Pr(X i | Z ij =1, θ (t) ) XiXi j Use θ 1 (t), θ 2 (t), …, θ k (t) Use θ 0 (t)
23
23 M-step Details If Z ij (t) {0,1} it would be straightforward: Calculate profile θ 1, θ 2, …, θ k from motif instances and θ r0 from frequency of r outside of motif instances. But Z ij (t) [0,1], so weight these frequencies by the appropriate values of Z ij (t).
24
24 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
25
25 Gibbs Sampler Lawrence et al., 1993 Very general iterative method, related to Markov Chain Monte Carlo (MCMC) Available at bayesweb.wadsworth.org/gibbs/gibbs.html
26
26 One Iteration of Gibbs Sampler n motif instances each of length k GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG
27
27 One Iteration of Gibbs Sampler n motif instances each of length k Remove one at random Form profile of remaining n-1 Let p i be the probability with which g[i.. i+k-1] fits profile GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG i
28
28 One Iteration of Gibbs Sampler n motif instances each of length k Remove one at random Form profile of remaining n-1 Let p i be the probability with which g[i.. i+k-1] fits profile Choose to start replacement at i with probability proportional to p i GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG i
29
29 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
30
30 FootPrinter Blanchette & Tompa, 2002 First algorithm explicitly designed for phylogenetic footprinting Available at bio.cs.washington.edu/software.html
31
31 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA evolve slower than nonfunctional ones.
32
32 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA evolve slower than nonfunctional ones. Consider a set of orthologous (i.e., corresponding) sequences from different species Identify unusually well conserved substrings (i.e., ones that have not changed much over the course of evolution)
33
33 CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
34
34 FootPrinter Inputs: –evolutionary tree T –corresponding regulatory regions at leaves Output: motifs well conserved w.r.t. T.
35
35 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4
36
36 Most Parsimonious Solution “Parsimony score”: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT
37
37 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-hard.
38
38 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: + ACGT: 0... … ACGG: 1 ACGT: 0... 4 k entries … ACGG: 0 ACGT: + ... … ACGG: ACGT :0...
39
39 W u [s] = min ( W v [t] + d(s, t) ) v : child t of u Running Time Number of species Average sequence length Motif length Total time O(n k (4 k + l ))
40
40 Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)
41
41 Application to -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)
42
42 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2
43
43 Motifs Absent from Some Species Find motifs –with small parsimony score –that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with –score 0 spanning at least 250 Myrs –score 1 spanning at least 350 Myrs –score 2 spanning at least 450 Myrs –score 3 spanning at least 550 Myrs
44
44 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Puffer fish Chicken Pig Mouse Hamster Human 10 2 7 2 2 2 1 0 1 1 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28
45
45 Application to c-fos Gene MotifScoreConserved inKnown? CAGGTGCGAATGTTC04 mammals TTCCCGCCTCCCCTCCCC04 mammalsyes GAGTTGGCTGcagcc3puffer + 4 mammals GTTCCCGTCAATCcct1chicken + 4 mammals yes CACAGGATGTcc4all 6 yes AGGACATCTG1chicken + 4 mammals yes GTCAGCAGGTTTCCACG04 mammals yes TACTCCAACCGC04 mammals metK in B. subtilis
46
46 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter
47
47 MicroFootPrinter Neph & Tompa, 2006 Designed specifically for phylogenetic footprinting in prokaryotic genomes Front end to FootPrinter Available at bio.cs.washington.edu/software.html
48
48 Microbial Footprinting 1454 prokaryotes with genomes completely sequenced ( as of 2/17/2011 ) –For any prokaryotic gene of interest, plenty of close genes in other species available –Relatively simple genomes MicroFootPrinter –undergraduate Computational Biology Capstone project –Goal: simple interface for microbiologists –User specifies species and gene of interest –Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters
49
49 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) –chvI (two component response regulator)chvI –ropB (outer membrane protein )ropB
50
50 Sample chvI motif Parsimony score: 2 Span: 41.10 Significance score: 4.22 B. henselae - 151 GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum -106 GCCACAATTT S. meliloti -119 GCCACAATTT S. medicae -118 GCCACAATTT A. tumefaciens -105 GCCACAATTT M. loti -80 GCCACATTTT M. sp. -87 GCCACATTTT O. anthropi -158 GCCACATTTT B. suis -38 GCCACATTTT B. melitensis -156 GCCACATTTT B. abortus -156 GCCACATTTT B. ovis -156 GCCACATTTT B. canis -38 GCCACATTTT
51
51 Sample ropB motif Parsimony score:1 Span:20.70 Significance score:1.34 Jannaschia sp.-151 CACATTTTGG R. etli-134 CACAATTTGG R. leguminosarum-135 CACAATTTGG A. tumefaciens-131 CACATTTTGG S. meliloti-128 CACATTTTGG S. medicae-128 CACATTTTGG
52
52 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y ={C,T} W ={A,T}
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.