Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.

Similar presentations


Presentation on theme: "RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc."— Presentation transcript:

1 RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc

2 CS262 Lecture 20, Win07, Batzoglou Stochastic Context Free Grammars In an analogy to HMMs, we can assign probabilities to transitions: Given grammar X 1  s 11 | … | s in … X m  s m1 | … | s mn Can assign probability to each rule, s.t. P(X i  s i1 ) + … + P(X i  s in ) = 1

3 CS262 Lecture 20, Win07, Batzoglou The CYK algorithm (Cocke-Younger-Kasami) Initialization: For i = 1 to N, any nonterminal V,  (i, i, V) = log P(V  x i ) Iteration: For i = 1 to N – 1 For j = i+1 to N For any nonterminal V,  (i, j, V) = max X max Y max i  k<j  (i,k,X) +  (k+1,j,Y) + log P(V  XY) Termination: log P(x | ,  * ) =  (1, N, S) Where  * is the optimal parse tree (if traced back appropriately from above) i j V XY

4 CS262 Lecture 20, Win07, Batzoglou Evaluation Recall HMMs: Forward:f l (i) = P(x 1 …x i,  i = l) Backward:b k (i) = P(x i+1 …x N |  i = k) Then, P(x) =  k f k (N) a k0 =  k a 0k e k (x 1 ) b k (1) Analogue in SCFGs: Inside:a(i, j, V) = P(x i …x j is generated by nonterminal V) Outside: b(i, j, V) = P(x, excluding x i …x j is generated by S and the excluded part is rooted at V)

5 CS262 Lecture 20, Win07, Batzoglou The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) =  X  Y  k a(i, k, X) a(k+1, j, Y) P(V  XY) kk+1 i j V XY

6 CS262 Lecture 20, Win07, Batzoglou Algorithm: Inside Initialization: For i = 1 to N, V a nonterminal, a(i, i, V) = P(V  x i ) Iteration: For i = 1 to N – 1 For j = i + 1 to N For V a nonterminal a(i, j, V) =  X  Y  k a(i, k, X) a(k+1, j, X) P(V  XY) Termination: P(x |  ) = a(1, N, S)

7 CS262 Lecture 20, Win07, Batzoglou The Outside Algorithm b(i, j, V) = Prob(x 1 …x i-1, x j+1 …x N, where the “gap” is rooted at V) Given that V is the right-hand-side nonterminal of a production, b(i, j, V) =  X  Y  k<i a(k, i – 1, X) b(k, j, Y) P(Y  XV) i j V k X Y

8 CS262 Lecture 20, Win07, Batzoglou Algorithm: Outside Initialization: b(1, N, S) = 1 For any other V, b(1, N, V) = 0 Iteration: For i = 1 to N – 1 For j = N down to i For V a nonterminal b(i, j, V) =  X  Y  k<i a(k, i – 1, X) b(k, j, Y) P(Y  XV) +  X  Y  k<i a(j + 1, k, X) b(i, k, Y) P(Y  VX) Termination: It is true for any i, that: P(x |  ) =  X b(i, i, X) P(X  x i )

9 CS262 Lecture 20, Win07, Batzoglou Learning for SCFGs We can now estimate c(V) = expected number of times V is used in the parse of x 1 ….x N 1 c(V) = ––––––––  1  i  N  i  j  N a(i, j, V) b(i, j, V) P(x |  ) 1 c(V  XY) = ––––––––  1  i  N  i<j  N  i  k<j b(i, j, V) a(i, k, X) a(k+1, j, Y) P(V  XY) P(x |  )

10 CS262 Lecture 20, Win07, Batzoglou Learning for SCFGs Then, we can re-estimate the parameters with EM, by: c(V  XY) P new (V  XY) = –––––––––––– c(V) c(V  a)  i: xi = a b(i, i, V) P(V  a) P new (V  a) = –––––––––– = –––––––––––––––––––––––––––––––– c(V)  1  i  N  i<j  N a(i, j, V) b(i, j, V)

11 CS262 Lecture 20, Win07, Batzoglou Summary: SCFG and HMM algorithms GOALHMM algorithmSCFG algorithm Optimal parseViterbiCYK EstimationForwardInside BackwardOutside LearningEM: Fw/BckEM: Ins/Outs Memory ComplexityO(N K)O(N 2 K) Time ComplexityO(N K 2 )O(N 3 K 3 ) Where K: # of states in the HMM # of nonterminals in the SCFG

12 CS262 Lecture 20, Win07, Batzoglou Physics-based models for RNA structure prediction Physics models define free energy, ∆E(x, y) (1) Decompose y into secondary structure element (2) Assess “free energy” contribution for features of each element (3)∆E(x, y) = sum of free energies for all features of all elements in y helix ΔE GC/GC ΔE hairpin_length[4] hairpin loop

13 CS262 Lecture 20, Win07, Batzoglou The Zuker algorithm – main ideas Models energy of a fold in terms of specific features: 1.Pairs of base pairs (stacked pairs) 2.Bulges 3.Loops (size, composition) 4.Interactions between stem and loop 5’ 3’ position i position j length l 5’ 3’ position i position j 5’ 3’ positions i position j position j’

14 CS262 Lecture 20, Win07, Batzoglou Comparison of SCFGs and Physics Models stochastic context-free grammars physics-based models P(x, y)∆E(x, y) Find highest scoring pseudoknot-free structure via dynamic programming.

15 CS262 Lecture 20, Win07, Batzoglou Comparison of SCFGs and Physics Models train parameters automatically easily tuned to specific data grammar rules arbitrary less accurate in practice parameters measured manually necessarily general-purpose energetics determines structure most accurate in practice     Can we get the accuracy of physics-based methods and the parameter learning ease of probabilistic methods? SCFG S ΔE UA/AU P HYSICS-BASED M ODELS

16 CS262 Lecture 20, Win07, Batzoglou In CONTRAfold, the logarithm of the probability of a structure y is a linear function of features: In energy-based models, the free energy of a structure y is a linear function of features: Conditional log-linear models (CLLMs) -- CONTRAfold where parameters features“score” of structure y Physics-based models

17 CS262 Lecture 20, Win07, Batzoglou In CONTRAfold, the logarithm of the probability of a structure y is a linear function of features: parameters features“score” of structure y (1) We can define set of features that we suspect to be energetically important (2) Parameters w learned automatically and “useless” features are discarded Key points: (2) Model is extensible Conditional log-linear models (CLLMs) -- CONTRAfold

18 CS262 Lecture 20, Win07, Batzoglou Posterior Decoding for Structure Prediction 1.Compute a base-pairing probability matrix for sequence x 2.Find secondary structure maximizing  E[correct pairings] + E[correct unpaired bases] AGCAGAGUGG … A G C A G A G U G G …  = 1  = 8  = 1024

19 CS262 Lecture 20, Win07, Batzoglou Modeling multiple sequences Given: aligned sequences x 1,…,x n 11 22 33 44 55 66 Compute sequence weights based on evolutionary tree… …then modify scoring function appropriately.

20 CS262 Lecture 20, Win07, Batzoglou Evaluation Methodology rfam RNase P 5s rRNA GcvB RNA U14 snoRNA tmRNA … Fold A Example #1 Example #2 Example #151 … 503 annotated families Fold B 151 independent examples Two-fold cross validation

21 CS262 Lecture 20, Win07, Batzoglou RNAfold (single) 0.50.60.70.80.90.4 0.3 0.4 0.5 0.6 0.7 CONTRAfold (single) Specificity Sensitivity Pfold (single) Pfold (multi) Alifold (multi) 0.8 CONTRAfold (multi) CONTRAfold (single) CONTRAfold (single) Mfold Comparison of methods for RNA structure prediction

22 CS262 Lecture 20, Win07, Batzoglou Probabilistic Models for Biosequences DNA EXON INTRON EXON INTRON EXON INTRON EXON INTRON EXON HMM Sequence, Parse: atcgacggcggctacgtatctcagcagctgctgtgt agtgctgtcgctgtgctgtgtcgagctgagagaga gagcgtcgtagctcgatcactgtggatggctgatat agctacgagcgcggcgctatgtctatgtgttgctatg exon, intron, intergenic Prob -CFG Sequence, Parse: agucuucacugaaugugauggacacuccgagacu ((((..(((.....))).((((...)))).)))) Hairpin, Loop, Stem, … ---Q--------FKAGYCDEPADEG ALGQDMVTVAWCFKAGYCDEPAFEG AELTMPG--M-MTFVMEIKVT--HP AELTMPGNMMNMTFV---KVTDMHP GN----FH-WTWQ--LY--LLPCKD GNTPKDFHDWT-QLVLYTRVLPPKD Pair- HMM Sequence, Sequence, Parse: AELTMPG--M-MTFVMEIKVT--HPGN----FH-WTWQ--LYTRLL AELTMPGNMMNMTFV---KVTDMHPGNTPKDFHDWT-QLVLY--VL MMMMMMMXXMXMMMMYYYMMMXXMMMMXXXXMMXMMYMXXMMYYMM Match, GapX, GapY

23 CS262 Lecture 20, Win07, Batzoglou Finding Regulatory Motifs using Comparative Genomics

24 Motifs are preferentially conserved across evolution Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA.. Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** ** Gal10Gal1 Gal4 GAL10 GAL1 TBP GAL4 MIG1 TBP MIG1 Factor footprint Conservation island slide credits: M. Kellis Is this enough to discover motifs? No.

25 CS262 Lecture 20, Win07, Batzoglou Known motifs are frequently conserved Across the human promoter regions, the Err  motif:  appears 434 times  is conserved 162 times Human Dog Mouse Rat Err  Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err  enrichment: 5.4-fold –Err  p-value < 10 -50 (25 standard deviations under binomial) Motif Conservation Score (MCS) slide credits: M. Kellis

26 CS262 Lecture 20, Win07, Batzoglou Finding conserved motifs in whole genomes 1.Define seed “mini-motifs” 2.Filter and isolate mini-motifs that are more conserved than average 3.Extend mini-motifs to full motifs 4.Validate against known databases of motifs & annotations 5.Report novel motifs CTACGA N slide credits: M. Kellis

27 CS262 Lecture 20, Win07, Batzoglou Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG slide credits: M. Kellis

28 CS262 Lecture 20, Win07, Batzoglou Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes slide credits: M. Kellis

29 CS262 Lecture 20, Win07, Batzoglou Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation slide credits: M. Kellis

30 CS262 Lecture 20, Win07, Batzoglou Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3 slide credits: M. Kellis

31 CS262 Lecture 20, Win07, Batzoglou Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives Most discovered motifs are likely to be functional New slide credits: M. Kellis


Download ppt "RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc."

Similar presentations


Ads by Google