Download presentation
Presentation is loading. Please wait.
2
RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University
3
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
4
What are RNA and mRNA? Traditional role as messenger molecule (mRNA) RNA is a polymer of nucleotides A, U, C, and G transcribed from DNA GATTACA GAUUACA
5
What is RNA secondary structure/folding? bulge loop helix (stem) hairpin loop internal loop multi-branch loop
6
Pseudoknots Pseudoknots will not be treated in this talk. Not dealt with by either paper.
7
non-coding RNA (RNA genes) RNA enzymes: catalytic RNA Ribosomal RNA (rRNA) Transfer RNA (tRNA) RNAi: RNA mediated gene regulation Micro RNA (miRNA) Short-interfering RNA (siRNA) Alternative splicing: small-nuclear RNA (snRNA) Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA Structure essential to function for many ncRNAs
8
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
9
CONTRAfold Problem: Given an RNA sequence, predict the most likely secondary structure AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA
10
How does CONTRAfold work? CONTRAfold looks at features that indicate a good structure C-G base pairings A-U base pairings Helices of length 5 Hairpin loops of size 9 Bulge loops of size 2 CG/GC Base-pair stacking interactions For example: These examples are called thermodynamic parameters because they represent free energy values
11
How does CONTRAfold choose a structure? Every feature f i is associated with a weight w i. The probability of a structure y, given a sequence x, is determined by the following relationship: ) ( exp structuresequence weight of Feature i # of occurrences of feature i, in structure y generated from sequence x
12
How does CONTRAfold choose a structure? Cont’d Considers all structures and finds optimal structure via dynamic programming in O(n 3 ) Added bonus: probability associated with each base Low confidence bases lighter High confidence bases darker
13
Parameter γ allows trade-off between sensitivity and specificity Sensitivity = # correct base pairings # true base pairings Specificity = # correct base pairings # predicted base pairings = 1 = 8 = 1024 AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA
14
CONTRAfold learns how to predict good structures CONTRAfold trains on set of published examples of known RNA structures taken from a database called Rfam (RNA families) CONTRAfold learns the relative value, or weight, of each of its features CONTRAfold determines the weight for each feature that maximizes its performance on the training set. A training set is a collection of known correct solutions that a program learns from.
15
CONTRAfold Performance
16
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
17
Other Methods Stochastic context-free grammars Physics-based models
18
All features reflect thermodynamic interactions Features experimentally determined in lab, rather than learned Disadvantages to CONTRAfold Thermodynamic weights difficult to calculate No incorporation of non-thermodynamic features Cannot be tailored to specific families of RNAs since weights always the same Cannot trade off between sensitivity and specificity No associated probabilities with each pair-bonding Until CONTRAfold, best performing method
19
acuSag Stochastic context-free grammars Based on grammar rules with associated probabilities S aSu | cSg | aS | uS | … | Su | SS | ε P.21.15.11.08.03.22.02 S aSaS acSgacSg acuSuag acugScuag acuguScuag acuguaScuag acuguauScuag acuguaucuag.(((...).)) Let’s generate a structure for the sequence acuuauuag acuguacuag.(((..).)) acugucuag.(((.).)) acugcuag.((().)) acuuag.((.)) acuag.(()) acg.() a.a. We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure.
20
Stochastic context-free grammars cont’d Disadvantages to CONTRAfold Grammar rules of SCFG less expressive than features of CONTRAfold or physics-based methods Poor accuracy: always dominated by physics-based models Like CONTRAfold, transformation probabilities can be automatically trained Therefore, they can also be optimized to specific datasets Provide an associated probability with a given structure
21
Advantages of CONTRAfold High accuracy Automated training of parameters Can be tuned to specific data Provides associated probabilities for each base-pairing Ability to control sensitivity/specificity trade-off Can incorporate both physics-based and non-thermodynamic parameters
22
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
23
How is RNA folding done? Simple Nussinov Folding Algorithm Only scores interactions between paired bases Useful for demonstrating general structure of more complex folding algorithms Score for optimal structure from base i to base j Base i is unpaired, consider pairing between i+1 and j We want the highest scoring fold Base j is unpaired, consider pairing between i and j-1 δ (i, j) = score for a pairing between i and j.
24
How is RNA folding done? Simple Nussinov Folding Algorithm Only scores interactions between paired bases Useful for demonstrating general structure of more complex folding algorithms Pair i and j. Now consider pairing between i+1 and j-1.
25
How is RNA folding done? Simple Nussinov Folding Algorithm Only scores interactions between paired bases Useful for demonstrating general structure of more complex folding algorithms i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.
26
How is RNA folding done? What is the runtime of the Nussinov algorithm? All possible value of iO(n) All possible values of j O(n) For a given sequence of length n = j – i we must consider: For each i we must consider: For each i, j pair we must consider: All possible values of k O(n) O(n) * O(n) * O(n) → O(n 3 )
27
A more sophisticated algorithm We want to take into account more advanced features than just base-pairings.
28
i j What is V(i, j)? eh = Energy of a hairpin closed at i and j
29
What is V(i, j)? es = Energy of stacked pair i, j and i+1, j-1 i j
30
What is V(i, j)? ebi = Energy of a bulge or interior loop that begins at i, j and is closed at i ’, j ’ i j i’i’ j’j’
31
What is V(i, j)? Same old bifurcation equation, but i is paired to j
32
What is its runtime? Still only O(n 3 ) because we are only recursing on i, j, and k This equation theoretically O(n), however, it is standard to bound RNA interior loops by a constant (30), making it O(1)
33
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
34
CandidateFold Same folding as complex model in O(n 2 ψ(n)), where ψ(n) is shown to a constant What does it do? Imposes some constraints on W and V How does it do it? From WFrom V Rather than trying all k, they keep a list of candidate positions reducing this step to O(1) time
35
CandidateFold Much faster RNA folding What is the advantage of CandidateFold? Accessible motif finding What is an application of high-speed RNA folding?
36
Presentation Overview Background on RNA secondary structure prediction CONTRAfold: probabilistic RNA folding How is RNA folding done from an algorithmic perspective? Other RNA folding methods: Physics-based methods and SCFGs CandidateFold: RNA folding in O(n 2 ) Genome-wide accessible motif detection
37
What is an RNA regulatory motif? Motif: A conserved sequence element A regulator binds to a regulatory motif RNA regulatory motif: A motif used to regulate translation G A U U A C A... RNA Regulatory motif (AUUAC) Regulatory protein Micro RNA U A A U G microRNA
38
What is an accessible motif? If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding
39
Accessible motifs cont’d Therefore, only accessible sequences should be scanned for regulatory motifs
40
Accessible motifs cont’d Therefore, only accessible sequences should be scanned for regulatory motifs.
41
How do Wexler et al. detect regulatory motifs? Stage 1: Process sequence set G to extract all “accessible windows” Run sliding window of size k across each mRNA sequence Find the minimal energy fold for the sequence, assuming none of the bases in the window are paired If the energy of this folding minus the energy of a normal folding of the mRNA < δ, then accept the window Problem: Given a set of mRNAs G, a parameter k denoting motif window size, and a pre-defined energy threshold δ, find the regulatory motifs Stage 2: Search for regulatory motifs among the “accessible windows” Motif finding will be discussed in later lectures
42
Results: Degradation Related Motifs
43
Results: Tissue Specific microRNAs Silique: A long, slender, many-seeded, cylindrical fruit of the Mustard Family
44
The End
45
Works Cited CB Do, DA Woods, S Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14): e90-e98, 2006. Y Wexler, C Zilberstein, M Ziv-Ukelson. A Study of Accessible Motifs and RNA Folding Complexity. Recomb 2006, LNBI 3909: 473-487, 2006.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.