CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
CpG islands in DNA sequences
(Regulatory-) Motif Finding
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Hidden Markov Models.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity.
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Bayesian Machine learning and its application Alan Qi Feb. 23, 2009.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
Lecture 20 Practical issues in motif finding Final project
Hidden Markov Models BMI/CS 576
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
(Regulatory-) Motif Finding
Finding regulatory modules
Motif finding in groups of related sequences
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou Finding Regulatory Motifs

CS262 Lecture 9, Win07, Batzoglou Regulatory Motif Discovery DNA Group of co-regulated genes Common subsequence Find motifs within groups of corregulated genes slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size  Because a constant-size transcription factor binds Often repeated Low-complexity-ish

CS262 Lecture 9, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information content at that position

CS262 Lecture 9, Win07, Batzoglou Problem Definition Probabilistic Motif: M ij ; 1  i  W 1  j  4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with  k differences Given a collection of promoter sequences s 1,…, s N of genes with common expression

CS262 Lecture 9, Win07, Batzoglou Discrete Approaches to Motif Finding

CS262 Lecture 9, Win07, Batzoglou Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) =  i d(W, x i )

CS262 Lecture 9, Win07, Batzoglou Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N =  i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time

CS262 Lecture 9, Win07, Batzoglou Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

CS262 Lecture 9, Win07, Batzoglou MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W

CS262 Lecture 9, Win07, Batzoglou MULTIPROFILER Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W  L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

CS262 Lecture 9, Win07, Batzoglou MULTIPROFILER Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S  N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G  W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 2 above: Smaller motif-finding problem; Use exhaustive search

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization in Motif Finding

CS262 Lecture 9, Win07, Batzoglou All K-long words motif background Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all k-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 ) (assume {A, C, G, T}) where M ij = Prob[ letter j occurs in motif position i ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ] motif background ACGTACGT M1M1 MKMK M1M1 B

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[s]…x[s+k], P[ X i, Z i1 =1 ] = M 1x[s] …M kx[s+k] P[ X i, Z i2 =1 ] = (1 – ) B x[s] …B x[s+k] Let 1 = ; 2 = (1 – ) motif background ACGTACGT M1M1 MKMK M1M1 B 1 –

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization Define: Parameter space  = (M, B)  1 : Motif;  2 : Background Objective: Maximize log likelihood of model: ACGTACGT M1M1 MKMK M1M1 B 1 – 

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

CS262 Lecture 9, Win07, Batzoglou Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows: Expectation Maximization: E-step

CS262 Lecture 9, Win07, Batzoglou Expectation Maximization: M-step Maximization: Maximize expected value over  and independently For, this has the following solution: (we won’t prove it) Effectively, NEW is the expected # of motifs per position, given our current parameters

CS262 Lecture 9, Win07, Batzoglou For  = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] c ij values are calculated easily from Z* values It then follows: to not allow any 0’s, add pseudocounts Expectation Maximization: M-step

CS262 Lecture 9, Win07, Batzoglou Initial Parameters Matter! Consider the following artificial example: 6-mers X 1, …, X n :(n = 2000)  990 words “AAAAAA”  990 words “CCCCCC”  20 words “ACACAC” Some local maxima: = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA = 1%; B = 50% C, 50% A M = 100% ACACAC

CS262 Lecture 9, Win07, Batzoglou Overview of EM Algorithm 1.Initialize parameters  = (M, B), :  Try different values of from N -1/2 up to 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in  = (M, B), falls below  4.Report results for several “good”

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling in Motif Finding

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1, …, x N Maximizing log-odds likelihood ratio:

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1.Initialization: a.Select random locations in sequences x 1, …, x N b.Compute an initial model M from these locations 2.Sampling Iterations: a.Remove one sequence x i b.Recalculate model c.Pick a new location of motif in x i according to probability the location is a motif occurrence

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling Initialization: Select random locations  1,…,  N in x 1, …, x N For these locations, compute M: where  j are pseudocounts to avoid 0s, and B =  j  j That is, M kj is the number of occurrences of letter j in motif position k, over the total

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling Predictive Update: Select a sequence x = x i Remove x i, recompute model: where  j are pseudocounts to avoid 0s, and B =  j  j M

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

CS262 Lecture 9, Win07, Batzoglou Gibbs Sampling Running Gibbs Sampling: 1.Initialize 2.Run until convergence 3.Repeat 1,2 several times, report common motifs

CS262 Lecture 9, Win07, Batzoglou Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

CS262 Lecture 9, Win07, Batzoglou Repeats, and a Better Background Model Repeat DNA can be confused as motif  Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

CS262 Lecture 9, Win07, Batzoglou Limits of Motif Finders Given upstream regions of coregulated genes:  Increasing length makes motif finding harder – random motifs clutter the true ones  Decreasing length makes motif finding harder – true motif missing in some sequences Motif Challenge problem: Find a (15,4) motif in N sequences of length 0 gene ???

CS262 Lecture 9, Win07, Batzoglou Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles 1.Clustering genes according to common expression K-means clustering -> 30 clusters, genes/cluster Clusters correlate well with known function 2.AlignACE motif finding 600-long upstream regions

CS262 Lecture 9, Win07, Batzoglou Motifs in Periodic Clusters

CS262 Lecture 9, Win07, Batzoglou Motifs in Non-periodic Clusters

CS262 Lecture 9, Win07, Batzoglou Motifs are preferentially conserved across evolution Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA.. Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** ** Gal10Gal1 Gal4 GAL10 GAL1 TBP GAL4 MIG1 TBP MIG1 Factor footprint Conservation island slide credits: M. Kellis Is this enough to discover motifs? No.

CS262 Lecture 9, Win07, Batzoglou Comparison-based Regulatory Motif Discovery Study known motifs Derive conservation rules Discover novel motifs slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Known motifs are frequently conserved Across the human promoter regions, the Err  motif:  appears 434 times  is conserved 162 times Human Dog Mouse Rat Err  Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err  enrichment: 5.4-fold –Err  p-value < (25 standard deviations under binomial) Motif Conservation Score (MCS) slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Finding conserved motifs in whole genomes M. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals 1.Define seed “mini-motifs” 2.Filter and isolate mini-motifs that are more conserved than average 3.Extend mini-motifs to full motifs 4.Validate against known databases of motifs & annotations 5.Report novel motifs CTACGA N slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3 slide credits: M. Kellis

CS262 Lecture 9, Win07, Batzoglou Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives Most discovered motifs are likely to be functional New slide credits: M. Kellis