Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Lecture 6, Thursday April 17, 2003
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Genome annotation. What we have GATCAATGATGATAGGAATTGAAAGTGTCTTAATTACAATCCCTGTGCAATTATTAATAACTTTTTTGTT CACCTGTTCCCAGAGGAAACCTCAAGCGGATCTAAAGGAGGTATCTCCTCAAAAGCATCCTCTAATGTCA.
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Identification of regulatory elements. Transcriptional Regulation Strongest regulation happens during transcription Best place to regulate: No energy.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
(Regulatory-) Motif Finding
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Experimental methods in genome analysis. Genomic sequences are boring GATCAATGATGATAGGAATTGAAAGTGTCTTAATTACAATCCCTGTGCAATTATTAATAACTTTTTTGTT CACCTGTTCCCAGAGGAAACCTCAAGCGGATCTAAAGGAGGTATCTCCTCAAAAGCATCCTCTAATGTCA.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Flat clustering approaches
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
(Regulatory-) Motif Finding
Motif finding in groups of related sequences
Presentation transcript:

Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs

Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs

A. Regulation of Gene Expression

Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages

Genome is fixed – Cells are dynamic A genome is static –Every cell in our body has a copy of same genome A cell is dynamic –Responds to external conditions –Most cells follow a cell cycle of division Cells differentiate during development

Gene regulation … is responsible for the dynamic cell Gene expression varies according to: –Cell type –Cell cycle –External conditions –Location

Where gene regulation takes place Opening of chromatin Transcription Translation Protein stability Protein modifications

Transcriptional Regulation Strongest regulation happens during transcription Best place to regulate: No energy wasted making intermediate products However, slowest response time After a receptor notices a change: 1.Cascade message to nucleus 2.Open chromatin & bind transcription factors 3.Recruit RNA polymerase and transcribe 4.Splice mRNA and send to cytoplasm 5.Translate into protein

Transcription Factors Binding to DNA Transcription regulation: Certain transcription factors bind DNA Binding recognizes DNA substrings: Regulatory motifs

Promoter and Enhancers Promoter necessary to start transcription Enhancers can affect transcription from afar

Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA

Regulation of Genes Gene RNA polymerase Transcription Factor (Protein) Regulatory Element DNA

Regulation of Genes Gene RNA polymerase Transcription Factor Regulatory Element DNA New protein

Example: A Human heat shock protein TATA box: positioning transcription start TATA, CCAAT: constitutive transcription GRE: glucocorticoid response MRE: metal response HSE: heat shock element TATASP1 CCAAT AP2 HSE AP2CCAAT SP1 promoter of heat shock hsp GENE

The Cell as a Regulatory Network Genes = wires Motifs = gates ABMake DC If C then D If B then NOT D If A and B then D D Make BD If D then B C gene D gene B

The Cell as a Regulatory Network (2)

B. DNA Microarrays Measuring gene transcription in a high- throughput fashion

What is a microarray

What is a microarray (2) A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Allow mRNAs from a sample to hybridize Measure number of hybridizations per spot

How to make a microarray Method 1: Printed Slides (Stanford) –Use PCR to amplify a 1Kb portion of each gene –Apply each sample on glass slide Method 2: DNA Chips (Affymetrix) –Grow oligonucleotides (20bp) on glass –Several words per gene (choose unique words) If we know the gene sequences, Can sample all genes in one experiment!

Goal of Microarray Experiments Measure level of gene expression across many different conditions: –Expression Matrix M: {genes}  {conditions}: M ij = |gene i | in condition j Deduce gene function Deduce gene regulatory networks – parts and connections-level description of biology

Steps Towards Achieving this Goal 1.Removing noise from gene expression levels 2.Feature Extraction 3.Clustering of genes/conditions 4.Analysis a.Statistical significance of clusters b.Finding regulatory sequence motifs c.Building regulatory networks d.Experimental verification

1. Removing Noise from Gene Expression Levels Expression levels vary with time, labs, concentrations, chemicals used Noise model: M ij = c i (a ij g i T i +  ij ) –M ij, T ij : observed and true level gene j, chip i –g i, c j : mult. error constant for gene i, chip j –a ij,  ij : error terms Parameter Estimation –c j : spike in control probes –g i : control experiment of known concentration –  ij, a ij : minimize according to normal distribution

2. Feature Extraction Sample Correlation –Expression level can be different, but genes related; or similar, but genes unrelated Select most relevant features –In clustering genes, most meaningful chips –In clustering conditions, most meaningful genes

3. Clustering of Genes and Conditions Unsupervised: –Hierarchical clustering –K-means clustering –Self Organizing Maps (SOMs) –Singular Value Decomposition (SVD) Supervised: –Support Vector Machines Could be useful to separate patient from non-patient genes and samples

Results of Clustering Gene Expression Human tumor patient and normal cells; various conditions Cluster or Classify genes according to tumors Cluster tumors according to genes

4. Analysis of Clustered Data Statistical Significance of Clusters Regulatory motifs responsible for common expression Regulatory Networks Experimental Verification

C. Finding Regulatory Motifs Tiny Multiple Local Alignments of Many Sequences

Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size –Because a constant-size transcription factor binds Often repeated Low-complexity-ish

Problem Definition Probabilistic Motif: M ij ; 1  i  W 1  j  4 M ij = Prob[ letter i, pos j ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with  k differences Given a collection of promoter sequences s 1,…, s N of genes with common expression

Essentially a Multiple Local Alignment Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases......

Algorithms Probabilistic 1.Expectation Maximization: MEME 2.Gibbs Sampling: AlignACE, BioProspector Combinatorial CONSENSUS, TEIRESIAS, SP-STAR, others

Discrete Approaches to Motif Finding

Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and a word in x i d(W, S) =  i d(W, x i )

Approaches Exhaustive Searches CONSENSUS MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWER

Exhaustive Searches Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N =  i |x i |)

Exhaustive Searches (2) 2. Sample-driven algorithm: For W = a K-long word in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W * Running time: O( K N 2 )

Exhaustive Searches (3) Problem with sample-driven approach: If: –True motif does not occur in data, and –True motif is “weak” Then, –random strings may score better than any instance of true motif

CONSENSUS (1) Algorithm: Cycle 1: For each word W in S For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

CONSENSUS (2) Algorithm (cont’d): Cycle l: For each word W in S For each alignment A j from cycle l-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A cl

CONSENSUS (3) C 1, …, C n are user-defined heuristic constants Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total =  i C i, typically O(nC), where C is a big constant

MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N a (W) = words W’ in S s.t. d(W,W’)  a Idea: Assume W is occurrence of true motif W * Will use N a (W) to correct “errors” in W

MULTIPROFILER (2) Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

MULTIPROFILER (2) Algorithm: For each W in S: For L = 1 to L max Find all “strong” L-long wordlets G in N a (W) Modify W by the wordlet G -> W’ Compute d(W’, S) Report W * = argmin d(W’, S) Step 1: Smaller motif-finding problem; Use exhaustive search

Expectation Maximization in Motif Finding

Expectation Maximization (1) The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all K-long words 2.Assume each word is motif or background 3.Find likeliest motif & background models, and classification of words

Expectation Maximization (2) Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 )(assume {A, C, G, T}) where M ij = Prob[ motif position i is letter j ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ]

Expectation Maximization (3) Define Z i0 = { 1, if X i is motif; 0, otherwise } Z i1 = { 0, if X i is motif; 1, otherwise } Given a word X i = a[1]…a[K], P[ X i, Z i0 =1 ] = M 1a[1] …M ka[K] P[ X i, Z i1 =1 ] = (1 - ) B a[1] …B a[K]

Expectation Maximization (4) Define: Parameter space  = (M,B) Objective: Maximize log likelihood of model:

Expectation Maximization (5) Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

E-step Expectation Maximization (6): E-step Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows:

M-step Expectation Maximization (7): M-step Maximization: Maximize expected value over  and independently For, this is easy:

M-step Expectation Maximization (8): M-step For  = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] It easily follows: to not allow any 0’s, add pseudocounts

Initial Parameters Matter! Consider the following “artificial” example: x 1, …, x N contain: –2 K patterns A…A, A…AT,……, T…T –2 K patterns C…C, C…CG,……, G…G –D << 2 K occurrences of K-mer ACTG…ACTG Some local maxima:  ½; B = ½C, ½G; M i = ½A, ½T, i = 1,…, K  D/2 k+1 ; B = ¼A,¼C,¼G,¼T; M 1 = 100% A, M 2 = 100% C, M 3 = 100% T, etc.

Overview of EM Algorithm 1.Initialize parameters  = (M, B), : –Try different values of from N -1/2 upto 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in  = (M, B), falls below  4.Report results for several “good”

Conclusion One iteration running time: O(NK) –Usually need < N iterations for convergence, and < N starting points. –Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.

Gibbs Sampling in Motif Finding

Gibbs Sampling (1) Given: –x 1, …, x N, –motif length K, –background B, Find: –Model M –Locations a 1,…, a N in x 1, …, x N Maximizing log-odds likelihood ratio:

Gibbs Sampling (2) AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1.Initialization: a.Select random locations in sequences x 1, …, x N b.Compute an initial model M from these locations 2.Sampling Iterations: a.Remove one sequence x i b.Recalculate model c.Pick a new location of motif in x i according to probability the location is a motif occurrence

Gibbs Sampling (3) Initialization: Select random locations a 1,…, a N in x 1, …, x N For these locations, compute M: That is, M kj is the number of occurrences of letter j in motif position k, over the total

Gibbs Sampling (4) Predictive Update: Select a sequence x = x i Remove x i, recompute model: where  j are pseudocounts to avoid 0s, and B =  j  j M

Gibbs Sampling (5) Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

Gibbs Sampling (6) Running Gibbs Sampling: 1.Initialize 2.Run until convergence 3.Repeat 1,2 several times, report common motifs

Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

Gibbs Sampling vs. Viterbi Training Consider model as a (K+1)-state HMM: Background Pos 1Pos K …… Viterbi Training: 1.Find best  * = argmax(Prob[x,  ]) in all sequences 2.Recalculate parameters Gibbs: one sequence, sample from Prob[x,  ]

Repeats, and a Better Background Model Repeat DNA can be confused as motif –Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

Applications

Application 1: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles

Processing of Data 1.Selection of 3,000 genes Genes with most variable expression were selected 2.Clustering according to common expression K-means clustering 30 clusters, genes/cluster Clusters correlate well with known function 3.AlignACE motif finding 600-long upstream regions 50 regions/trial

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

Application 2: Discovery of Heat Shock Motif in C. Elegans Group: GuhaThakurta et al. 2002, C.D. Link’s lab & colleagues Data: Microarrays on 11,917 genes from C. Elegans Isolated genes upregulated in heat shock

Processing of Data, and Results Isolated 28 genes upregulated in heat shock during 5 separate experiments Motif finding with CONSENSUS and ANNSpec on 500- long upstream regions 2 motifs found: –TTCTAGAA: known heat shock factor (HSF) –GGGTGTC: previously unreported Conserved in comparison with C. Briggsae Validation by in vitro mutagenesis of a GFP reporter

Phylogenetic Footprinting (Slides by Martin Tompa)

Phylogenetic Footprinting (Tagle et al. 1988) Functional sequences evolve slower than nonfunctional ones Consider a set of orthologous sequences from different species Identify unusually well conserved regions

Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d. This problem is NP-hard.

Small Example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

Solution Parsimony score: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA A GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

An Exact Algorithm (generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Recurrence

O(k  4 2k ) time per node W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time

O(k  4 2k ) time per node Number of species Average sequence length Motif length Total time O(n k (4 2k + l )) W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time

Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)

Application to  -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)

Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2

Limits of Motif Finders Given upstream regions of coregulated genes: –Increasing length makes motif finding harder – random motifs clutter the true ones –Decreasing length makes motif finding harder – true motif missing in some sequences 0 gene ???

Limits of Motif Finders A (k,d)-motif is a k-long motif with d random differences per copy Motif Challenge problem: Find a (15,4) motif in N sequences of length L CONSENSUS, MEME, AlignACE, & most other programs fail for N = 20, L = 1000