Download presentation
Presentation is loading. Please wait.
1
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie
2
Motivation Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything
3
Location rRNA, snRNA Exons? Introns Viral vectors
4
Function
5
Function, pt. 2
6
Overview “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)
7
Comparison - Methodology RSEARCHDART (Stemloc) Sequence
8
Comparison, Pt. 2 - Uses RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence
9
Comparison, Pt. 3 - Complexity RSEARCH O((M - B)LD + BLD 2 ) to scan O(M 4 ) to calculate statistics DART (Stemloc) Between O(LM) and O(L 3 M 3 )
10
Background: Context Free Grammars Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S N P is a set of productions
11
Context Free Grammars, pt. 2 Sample Grammar N = {S, A, B} T = {a, u, c, g, } P = { S -> A | B, A -> aAc | aBc | g, B -> g }
12
Context Free Grammars, pt. 3 Parse Trees Parse: aagcc S A A g ca ca S A A g ca ca B
13
Stochastic CFG Each production associated with a probability Probabilities for all productions starting from a given nonterminal sum to one Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3 | B, 0.7
14
Pairwise (profile) SCFG Terminals in each production can exist in each of two strings E.g. W -> x i y k Vx j y l
15
RSEARCH: pSCFG Simplified Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture Sequence
17
Node Types vs. Node States Nodes types are what we want to do given model (e.g. MATP is match pair) Node state represents what happens when scanning a target sequence E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap
18
Node States Set of node states possible for node type
19
Gap Classes Gap class per node type/state pair
20
Transition Scores Gap class determines transition scores Gap penalties are affine
21
Emission Scores Emission scores determined empirically
23
Parameterizing the Model Emission Scores Substitution Matrices Scores are observed / random
24
RIBOSUM Matrices Start with MSA Whose MSA? RIBOSUM[X, Y] Sequences X% identical are reweighted to sum to 1 Only sequences Y% identical are counted in making matrices
25
Model Parameters Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty
26
Solution Guess and check “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”
27
Digression: Biostatistics Confidence intervals Expectation values
28
Gumbel Distribution Parameterized by and K E = KNe - x, P = 1 - e -E
29
Gumbel Distriubtion, pt. 2 K and depend on G+C content of target database For database with heterogeneous G+C content, compute K and for G+C bins
30
Putting it All Together Run against database substrings of length two times the query Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database, alignment, E-value, P-value Statistics need to be calculated for every query and target database
31
Time For a 113 nt sequence against 2.1 * 10 7 nt database, 2.9 CPU days. 2% computing statistics For a 330 nt sequence against 2.1 * 10 7 nt database, 38 CPU days. 7% computing statistics Parallelized to 33 minutes and 7.4 hours respectively
32
Shifting Gears Fold Envelopes Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity
33
Fold Envelopes, pt. 2 Conceptualize search over grammars and parse trees Each node in tree accounts for subsequence WuWu … Accounts for X i..j … Accounts for X 0..i and X j..L Outside sequence Inside sequence
34
Analogy: Message Passing Inside algorithm: likelihood of sequence over all possible parses Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence Inside-Outside algorithm: expected number each grammar production is used Use fold envelopes to limit messages by restricting subsequences considered
35
The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY) kk+1 i j V XY Batzolgou
36
Constructing Fold Envelopes Constrain to possible 2ndary structures Constrain to primary sequence alignment
37
Summary RSEARCH to find a set of possible homologs, sorted by score and statistics Fold Envelopes permit greater search depth in case of unfolded comparisons RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.