Download presentation
Presentation is loading. Please wait.
Published byMeredith Chapman Modified over 9 years ago
1
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff Bonis CISC841 - Bioinformatics
2
What Are Non-Coding RNAs (ncRNA)? “functional molecules that do not code for proteins” Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements Over 100 known ncRNA families
3
Secondary Structure of ncRNAs Conserved, therefore useful for identifying homologs Secondary structure is functionally important to RNAs Base pairing important in pattern searching e.g. 16s RNA - part of small subunit of prokaryote ribosome
4
What Techniques Exist? Two models that predict homologs in ncRNA families –Covariance Models (CMs) –Easy RNA Profile IdentificatioN (ERPIN) - http://tagc.univ-mrs.fr/erpin/ Both use multiple alignment of family members with secondary structure annotation Statistical model is built from this multiple alignment Display high sensitivity and low specificity
5
What about ERPIN? DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores Cannot take into account non-consensus bulges in helices (caused by indels) Need user specified score thresholds which compromises accuracy
6
CMs “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.” Can’t accommodate pseudoknots Very slow algorithm
7
Which model should be improved? Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway Address slow speed without sacrificing accuracy CMs used in Rfam - http://rfam.wustl.eduhttp://rfam.wustl.edu –8 gigabase genome DB called RFAMSEQ –Takes over a year to search for tRNA on P4 –Over 100 ncRNA families
8
Previous improvements on speed BLAST based heuristic –Known members are BLASTed against RFAMSEQ –CM is run on resulting set BLAST misses family members, especially where there is low sequence conservation tRNAscan-SE - http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ –Uses 2 heuristic based programs for tRNA searches –CM is used on resulting set –May miss tRNAs that CMs would find
9
How to improve sensitivity? Authors previously developed rigorous filters with 100% sensitivity of CM found set Filters based on profile HMMs –Profile HMM is built from CM then run on DB –Much of DB is filtered out, CM runs on remaining set HMM filter based on sequence conservation –Scanned for 126 of 139 ncRNA families in Rfam –Other 13 display low sequence conservation, but have strong conservation of secondary structure which HMM can’t take into account –Heuristic methods also miss these ncRNAs
10
How can these special biological situations be accounted for? Authors propose 3 innovations to overcome these setbacks –2 techniques to include secondary structure information in filtering at expense of CPU time Sub-CMs –Hybrid filtering composed of CMs and profile HMMs Store-Pair –Uses additional HMM states for modeling key base pairs –Third techique will help reduce scan time Runs filters in series with quickest first ending with most selective Shortest path problem
11
Results Techniques worked for 11 of the 13 previously missed Rfams –Also found new hits missed by BLAST In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding missed hits 100 times faster than raw CM on average Uncovers members missed by heuristics
12
What are CMs anyway? “statistical models that can detect when a positional sequence and secondary structure resemble a given multiple RNA alignment” Described in terms of stochastic context-free grammars (SCFGs) Transformational Grammars –Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide –Terminals: symbols in the actual string (nucleotides) –Non-Terminals: abstract symbols (states) –Parse: series of steps to obtain final output Example: –RNA molecules CAG or GAC –S1 -> c S2 g | g S2 C; S2 -> a –Parse: S1 -> c S2 g -> cag
13
How are CM’s used? Each rule is assigned a probability –Rules more consistent w/ family have higher probability The probability of a parse is the product of all the probability of the rules it used CMs use a log-odds ratios and sum the scores instead of multiplying CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time
14
How are profile HMMs and CMs combined? Given a CM, a profile HMM is created whose Viterbi score upper bounds the CM’s Viterbi score –Guarantees 100% sensitivity on CM Filtering: –At each nucleotide position in the subsequences of the database, a HMM is used to compute the CM score upper bound –A CM scan is applied to all subsequences that produce an upper bound exceeding some threshold –Subsequences that are below the threshold are filtered out. Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. –CM: S1 -> a S2 u | c S S2 G; S2 -> e –HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u A CM is expanded into a left and right HMM
15
How can these be supplemented? Selecting an optimal series of filters –Filtering fraction (fraction of DB left over) and run time are given by running an filter on a training sequence –Minimize expected total CPU time –Assumptions: estimated fractions and CPU times are constant for all training sequences A filter’s fraction is not affected by the previously run filters Optimal sequence of filters is solved as a shortest graph problem –nodes are filters and the CM –Weight of edges are CPU time
16
Sub-CM technique Exploit info in hairpins (bulges and internal loops) Much info is stored in short hairpins that need only part of the CMs states Grammar contains both HMM and CMs Window length of sub-CM is crucial HMMs are created manually after sub-CMs are found –Automation of this is a future project
17
Store-pair technique A HMM with extra states can reflect base pairs S1L[C] -> gS1L[C] has score neg. inf. 5 states are added per HMM state, but can be reduced
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.