Download presentation
Presentation is loading. Please wait.
1
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information content at that position
2
CS262 Lecture 18, Win07, Batzoglou Problem Definition Probabilistic Motif: M ij ; 1 i W 1 j 4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with k differences Given a collection of promoter sequences s 1,…, s N of genes with common expression
3
CS262 Lecture 18, Win07, Batzoglou Discrete Approaches to Motif Finding
4
CS262 Lecture 18, Win07, Batzoglou Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) = i d(W, x i )
5
CS262 Lecture 18, Win07, Batzoglou Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N = i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time
6
CS262 Lecture 18, Win07, Batzoglou Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif
7
CS262 Lecture 18, Win07, Batzoglou MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’) α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W
8
CS262 Lecture 18, Win07, Batzoglou Example AGATCGCATCGTAGCTGCATGTAGCTGC GCATGCTACTAGTCGTTTTACTACATAG ACGTGACTAGATAGCATGAGCGATCGTA AGCTAGCTAGCGATCACTGTGCGTGTCA ATGCTACTGTCTGTGCTGTGATTATCGA GGAGTCCACTAGCATCGATGGGCATCGG TATGCATCGATAACGCTGCGATTATGCG TAGAGTGCAGTATGATGCTGCTAGCGTG CTATCTACTATCGAGATTTTATCGTCCG TATTTTACAAGCATGCTGCTTTTAGCAG GGCGGCATCATCTGAGGACGGGATTACT TTTGTGGTCTGTACTATGATCCCAATGC W* = GATTACA W = GATCGCA d = 3 d = 4 d = 2 d = 4 d = 3 d = 2 d = 3 d = 4 d = 3 d = 1 Let = 3 N (W): √ √ √ √ √ √ √ TACTACA GATCACT GATAACG GAGTGCA GATTTTA GATTACT GATCCCA
9
CS262 Lecture 18, Win07, Batzoglou MULTIPROFILER Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG
10
CS262 Lecture 18, Win07, Batzoglou MULTIPROFILER Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 2 above: Smaller motif-finding problem; Use exhaustive search
11
CS262 Lecture 18, Win07, Batzoglou Example AGATCGCATCGTAGCTGCATGTAGCTGC GCATGCTACTAGTCGTTTTACTACATAG ACGTGACTAGATAGCATGAGCGATCGTA AGCTAGCTAGCGATCACTGTGCGTGTCA ATGCTACTGTCTGTGCTGTGATTATCGA GGAGTCCACTAGCATCGATGGGCATCGG TATGCATCGATAACGCTGCGATTATGCG TAGAGTGCAGTATGATGCTGCTAGCGTG CTATCTACTATCGAGATTTTATCGTCCG TATTTTACAAGCATGCTGCTTTTAGCAG GGCGGCATCATCTGAGGACGGGATTACT TTTGTGGTCTGTACTATGATCCCAATGC W* = GATTACA W = GATCGCA N (W): TACTACA d = 0 GATCACT d = 1 GATAACG d = 1 GAGTGCA d = 1 GATTTTA d = 1 GATTACT d = 0 GATCCCA d = 2 L = ---TA--
12
CS262 Lecture 18, Win07, Batzoglou Example AGATCGCATCGTAGCTGCATGTAGCTGC GCATGCTACTAGTCGTTTTACTACATAG ACGTGACTAGATAGCATGAGCGATCGTA AGCTAGCTAGCGATCACTGTGCGTGTCA ATGCTACTGTCTGTGCTGTGATTATCGA GGAGTCCACTAGCATCGATGGGCATCGG TATGCATCGATAACGCTGCGATTATGCG TAGAGTGCAGTATGATGCTGCTAGCGTG CTATCTACTATCGAGATTTTATCGTCCG TATTTTACAAGCATGCTGCTTTTAGCAG GGCGGCATCATCTGAGGACGGGATTACT TTTGTGGTCTGTACTATGATCCCAATGC W* = GATTACA W = GATCGCA N (W): TACTACA d = 0 GATCACT d = 1 GATAACG d = 1 GAGTGCA d = 1 GATTTTA d = 1 GATTACT d = 0 GATCCCA d = 2 L = ---TA--
13
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization in Motif Finding
14
CS262 Lecture 18, Win07, Batzoglou All K-long words motif background Expectation Maximization Algorithm (sketch): 1.Given genomic sequences list all k-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background S: A bag of k-long words
15
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization Given sequences x 1, …, x N, List all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 ) (assume {A, C, G, T}) where M ij = Prob[ letter j occurs in motif position i ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ] motif background ACGTACGT M1M1 MKMK M1M1 B 1 –
16
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[s]…x[s+k], P[ X i, Z i1 =1 ] = M 1x[s] …M kx[s+k] P[ X i, Z i2 =1 ] = (1 – ) B x[s] …B x[s+k] Let 1 = ; 2 = (1 – ) motif background ACGTACGT M1M1 MKMK M1M1 B 1 –
17
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization Define: Parameter space = (M, B) 1 : Motif; 2 : Background Objective: Maximize log likelihood of model: ACGTACGT M1M1 MKMK M1M1 B 1 –
18
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,
19
CS262 Lecture 18, Win07, Batzoglou Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows: Expectation Maximization: E-step
20
CS262 Lecture 18, Win07, Batzoglou Expectation Maximization: M-step Maximization: Maximize expected value over and independently For, this has the following solution: (we won’t prove it) Effectively, NEW is the expected # of motifs per position, given our current parameters
21
CS262 Lecture 18, Win07, Batzoglou For = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] c ij values are calculated easily from Z* values It then follows: to not allow any 0’s, add pseudocounts Expectation Maximization: M-step
22
CS262 Lecture 18, Win07, Batzoglou Initial Parameters Matter! Consider the following artificial example: 6-mers X 1, …, X n :(n = 2000) 990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC” Some local maxima: = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA = 1%; B = 50% C, 50% A M = 100% ACACAC
23
CS262 Lecture 18, Win07, Batzoglou Overview of EM Algorithm 1.Initialize parameters = (M, B), : Try different values of from N -1/2 up to 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in = (M, B), falls below 4.Report results for several “good”
24
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling in Motif Finding
25
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling Given: Sequences x 1, …, x N, motif length K, background B, Find: Model M Locations a 1,…, a N in x 1, …, x N Maximizing log-odds likelihood ratio:
26
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1.Initialization: a.Select random locations in sequences x 1, …, x N b.Compute an initial model M from these locations 2.Sampling Iterations: a.Remove one sequence x i b.Recalculate model c.Pick a new location of motif in x i according to probability the location is a motif occurrence
27
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling Initialization: Select random locations 1,…, N in x 1, …, x N For these locations, compute M: where j are pseudocounts to avoid 0s, and B = j j That is, M kj is the number of occurrences of letter j in motif position k, over the total
28
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling Predictive Update: Select a sequence x = x i Remove x i, recompute model: where j are pseudocounts to avoid 0s, and B = j j M
29
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j ) … M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j ) … B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob
30
CS262 Lecture 18, Win07, Batzoglou Gibbs Sampling Running Gibbs Sampling: 1.Initialize 2.Run until convergence 3.Repeat 1,2 several times, report common motifs
31
CS262 Lecture 18, Win07, Batzoglou Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space
32
CS262 Lecture 18, Win07, Batzoglou Repeats, and a Better Background Model Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)
33
CS262 Lecture 18, Win07, Batzoglou Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles 1.Clustering genes according to common expression K-means clustering -> 30 clusters, 50-190 genes/cluster Clusters correlate well with known function 2.AlignACE motif finding 600-long upstream regions
34
CS262 Lecture 18, Win07, Batzoglou Motifs in Periodic Clusters
35
CS262 Lecture 18, Win07, Batzoglou Motifs in Non-periodic Clusters
36
CS262 Lecture 18, Win07, Batzoglou Motifs are preferentially conserved across evolution Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA.. Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** ** Gal10Gal1 Gal4 GAL10 GAL1 TBP GAL4 MIG1 TBP MIG1 Factor footprint Conservation island slide credits: M. Kellis Is this enough to discover motifs? No.
37
CS262 Lecture 18, Win07, Batzoglou Comparison-based Regulatory Motif Discovery Study known motifs Derive conservation rules Discover novel motifs slide credits: M. Kellis
38
CS262 Lecture 18, Win07, Batzoglou Known motifs are frequently conserved Across the human promoter regions, the Err motif: appears 434 times is conserved 162 times Human Dog Mouse Rat Err Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err enrichment: 5.4-fold –Err p-value < 10 -50 (25 standard deviations under binomial) Motif Conservation Score (MCS) slide credits: M. Kellis
39
CS262 Lecture 18, Win07, Batzoglou Finding conserved motifs in whole genomes M. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals 1.Define seed “mini-motifs” 2.Filter and isolate mini-motifs that are more conserved than average 3.Extend mini-motifs to full motifs 4.Validate against known databases of motifs & annotations 5.Report novel motifs CTACGA N slide credits: M. Kellis
40
CS262 Lecture 18, Win07, Batzoglou Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG slide credits: M. Kellis
41
CS262 Lecture 18, Win07, Batzoglou Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes slide credits: M. Kellis
42
CS262 Lecture 18, Win07, Batzoglou Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation slide credits: M. Kellis
43
CS262 Lecture 18, Win07, Batzoglou Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3 slide credits: M. Kellis
44
CS262 Lecture 18, Win07, Batzoglou Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias < 7% false positives Most discovered motifs are likely to be functional New slide credits: M. Kellis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.