1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Multiple Sequence Alignment Motif Finding and Gene Prediction.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
(Regulatory-) Motif Finding
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Regulatory Motif Finding
Finding Regulatory Motifs in DNA Sequences
Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Journal club 06/27/08. Phylogenetic footprinting A technique used to identify TFBS within a non- coding region of DNA of interest by comparing it to the.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Regulatory Motif Finding
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Comparative Sequence Analysis in Molecular Biology
Phylogeny.
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

2 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

3 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

4 DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes T C C AA C GG T G C T G A G G T G C AC Gene Protein DNA

5 Regulation of Genes What turns genes on (producing a protein) and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? At what rate is the gene product produced?

6 Regulation of Genes Gene Regulatory Element Transcription Factor (Protein) DNA RNA polymerase (Protein)

7 Regulation of Genes DNA Regulatory Element Gene Transcription Factor (Protein) RNA polymerase (Protein)

8 Regulation of Genes RNA polymerase (Protein) DNA New protein Regulatory Element Gene Transcription Factor (Protein)

9 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short sequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)

10 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

11 2 Types of Motif Discovery 1.Motif discovery by overrepresentation One species Multiple (co-regulated) genes 2.Motif discovery by phylogenetic footprinting Multiple species One gene

12 Overrepresentation: Daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D

13 Phylogenetic Footprinting: Regulatory Element of Growth Hormone Gene -200 Chicken Rat Human Dog Sheep AGGGGATA AGGGTATA

14 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

15 MEME (Multiple EM for Motif Elicitation) Bailey & Elkan, 1995 Very general iterative method based on Expectation Maximization Available at meme.sdsc.edu/meme/website/intro.html

16 Overrepresented Motifs Given sequences X = {X 1, X 2, …, X n }, find statistically overrepresented motifs of length k For simplicity, assume –Exactly one motif instance per sequence –Sequences over DNA alphabet

17 Hidden Information Z = {Z ij }, where 1,if motif instance starts at Z ij =position j of X i 0,otherwise Iterate over probabilistic models that could generate X and Z, trying to converge on this solution {

18 Model Parameters Motif profile: 4×k matrix θ = (θ rp ),  r  {A,C,G,T}  1  p  k  θ rp = Pr(residue r in position p of motif) Background distribution:  θ r0 = Pr(residue r in random nonmotif position)

19 Profile Example GTTGTC GTTTCC GCTACC GTTACC GTTTCC profile θ

20 Overview: Expectation Maximization Goal: Find profile θ and motif positions Z that have maximum likelihood At each iteration: –E-step: From θ predict likely motif positions Z –M-step: From sequences at positions Z compute new profile θ

21 Expectation Maximization Goal: Find θ, Z that maximize Pr (X, Z | θ ) At iteration t: –E-step: Z (t) = E (Z | X, θ (t) ) –M-step: Find θ (t+1) that maximizes Pr (X, Z (t) | θ (t+1) )

22 E-step Details Z ij (t) = Pr(X i | Z ij =1, θ (t) ) Σ j Pr(X i | Z ij =1, θ (t) ) XiXi j Use θ 1 (t), θ 2 (t), …, θ k (t) Use θ 0 (t)

23 M-step Details If Z ij (t)  {0,1} it would be straightforward: Calculate profile θ 1, θ 2, …, θ k from motif instances and θ r0 from frequency of r outside of motif instances. But Z ij (t)  [0,1], so weight these frequencies by the appropriate values of Z ij (t).

24 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

25 Gibbs Sampler Lawrence et al., 1993 Very general iterative method, related to Markov Chain Monte Carlo (MCMC) Available at bayesweb.wadsworth.org/gibbs/gibbs.html

26 One Iteration of Gibbs Sampler n motif instances each of length k GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG

27 One Iteration of Gibbs Sampler n motif instances each of length k Remove one at random Form profile of remaining n-1 Let p i be the probability with which g[i.. i+k-1] fits profile GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG i

28 One Iteration of Gibbs Sampler n motif instances each of length k Remove one at random Form profile of remaining n-1 Let p i be the probability with which g[i.. i+k-1] fits profile Choose to start replacement at i with probability proportional to p i GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG i

29 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

30 FootPrinter Blanchette & Tompa, 2002 First algorithm explicitly designed for phylogenetic footprinting Available at bio.cs.washington.edu/software.html

31 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA evolve slower than nonfunctional ones.

32 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA evolve slower than nonfunctional ones. Consider a set of orthologous (i.e., corresponding) sequences from different species Identify unusually well conserved substrings (i.e., ones that have not changed much over the course of evolution)

33 CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA A GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

34 FootPrinter Inputs: –evolutionary tree T –corresponding regulatory regions at leaves Output: motifs well conserved w.r.t. T.

35 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

36 Most Parsimonious Solution “Parsimony score”: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

37 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-hard.

38 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

39 W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time Number of species Average sequence length Motif length Total time O(n k (4 k + l ))

40 Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)

41 Application to  -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)

42 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2

43 Motifs Absent from Some Species Find motifs –with small parsimony score –that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with –score 0 spanning at least 250 Myrs –score 1 spanning at least 350 Myrs –score 2 spanning at least 450 Myrs –score 3 spanning at least 550 Myrs

44 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Puffer fish Chicken Pig Mouse Hamster Human Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28

45 Application to c-fos Gene MotifScoreConserved inKnown? CAGGTGCGAATGTTC04 mammals TTCCCGCCTCCCCTCCCC04 mammalsyes GAGTTGGCTGcagcc3puffer + 4 mammals GTTCCCGTCAATCcct1chicken + 4 mammals yes CACAGGATGTcc4all 6 yes AGGACATCTG1chicken + 4 mammals yes GTCAGCAGGTTTCCACG04 mammals yes TACTCCAACCGC04 mammals metK in B. subtilis

46 Outline Regulation of genes Motif discovery by overrepresentation –MEME –Gibbs sampling Motif discovery by phylogenetic footprinting –FootPrinter –MicroFootPrinter

47 MicroFootPrinter Neph & Tompa, 2006 Designed specifically for phylogenetic footprinting in prokaryotic genomes Front end to FootPrinter Available at bio.cs.washington.edu/software.html

48 Microbial Footprinting 1454 prokaryotes with genomes completely sequenced ( as of 2/17/2011 ) –For any prokaryotic gene of interest, plenty of close genes in other species available –Relatively simple genomes MicroFootPrinter –undergraduate Computational Biology Capstone project –Goal: simple interface for microbiologists –User specifies species and gene of interest –Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters

49 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) –chvI (two component response regulator)chvI –ropB (outer membrane protein )ropB

50 Sample chvI motif Parsimony score: 2 Span: Significance score: 4.22 B. henselae GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum -106 GCCACAATTT S. meliloti -119 GCCACAATTT S. medicae -118 GCCACAATTT A. tumefaciens -105 GCCACAATTT M. loti -80 GCCACATTTT M. sp. -87 GCCACATTTT O. anthropi -158 GCCACATTTT B. suis -38 GCCACATTTT B. melitensis -156 GCCACATTTT B. abortus -156 GCCACATTTT B. ovis -156 GCCACATTTT B. canis -38 GCCACATTTT

51 Sample ropB motif Parsimony score:1 Span:20.70 Significance score:1.34 Jannaschia sp.-151 CACATTTTGG R. etli-134 CACAATTTGG R. leguminosarum-135 CACAATTTGG A. tumefaciens-131 CACATTTTGG S. meliloti-128 CACATTTTGG S. medicae-128 CACATTTTGG

52 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y ={C,T} W ={A,T}