CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure
CISC667, F05, Lec19, Liao2 Facts about RNAs Mainly as “information carrier” in protein synthesis –mRNA –tRNA Also as catalysts –Ribozymes Also as regulator –RNAi Single strand –Intra-molecular pairing –Secondary structure –Essential for sequence stability and function Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs
CISC667, F05, Lec19, Liao3
4
5
6 RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. GC AU CG GA AA UA CG GC GA CA UC CU GG GA CA x x x seq1seq2 seq3 seq1 seq2 seq3 C A G G A A A C U G G C U G C A A A G C G C U G C A A C U G
CISC667, F05, Lec19, Liao7 Representations for palindromic structures 1) 2) 3) GC AU CG GA AA C A G G A A A C U G ( ( (.... ) ) ) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds.
CISC667, F05, Lec19, Liao8 Secondary structure prediction Base pair maximization algorithm [Nussinov] m i,j : maximal # of base pairs that can be formed for sequence s i …s j. d i,j = 1, if s i and s j are paired = 0, otherwise Initialization m i,i-1 = 0 for i= [2, L] m i,i = 0 for i = [1,L] Recursion m i,j = max {m i+1,j, m i,j-1, m i+1,j-1 + d i,j, max i<k<j {m i,k + m k+1,j } // bifurcation }
CISC667, F05, Lec19, Liao GGGAAAUCC G G G A A A U C C CG CG UA AA G Dynamic programming table Time Complexity: O(L 3 ), where L is sequence length Space complexity: O(L 2 )
CISC667, F05, Lec19, Liao10 Secondary structure prediction Minimum energry algorithm [Zuker] E i,j : minimum energy for an optimal fold formed by sequence s i …s j. e i,j = -5, if s i, s j is CG or GC = -4, if s i, s j is AU or UA = -1, if s i, s j is GU or UG Initialization m i,i-1 = 0 for i= [2, L] m i,i = 0 for i = [1,L] Recursion E i,j = min {E i+1,j, E i,j-1, E i+1,j-1 + e i,j, min i<k<j {E i,k + E k+1,j } // bifurcation }
CISC667, F05, Lec19, Liao11 Motivation: Both DNA and protein sequences can be considered as sentences in a “language”. We know the “language” is not random What’s the grammar of the language? “The linguistics of DNA” – David Searls, American Scientist, 80(
CISC667, F05, Lec19, Liao12 Chomsky hierarchy of transformational grammars (1956) Regular expression (no recursive structure, cannot handle palindromes) –W aW, or W a Context-free –W Context-sensitive – 1 W 2 1 2 Unrestricted grammars – 1 W 2 Notations: a: any terminals; , : any string of non-terminals and/or terminals, including the null string; : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: W v W y W z W v a
CISC667, F05, Lec19, Liao13 To capture the palindromic structure, a context-free grammar can be given as follows. S aW 1 u | cW 1 g | gW 1 c | uW 1 a W 1 aW 2 u| cW 2 g| gW 2 c| uW 2 a W 2 aW 3 u| cW 3 g| gW 3 c| uW 3 a W 3 gaaa | gcaa seq1 seq2 C A G G A A A C U G G C U G C A A A G C c a g g a a a c u g w3w3 w2w2 w1w1 S
CISC667, F05, Lec19, Liao14 Stochastic grammars Motivations: –Irregularities (or exceptions to the grammar) in languages –Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. –It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. –This can be achieved by assigning probabilities to various productions. For any non-terminals, the probabilities of all its possible productions must add to 1. Given a sentence x, a stochastic grammar parsing the sentence will assign a probability P(x| ) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. – x language P(x| ) =1
CISC667, F05, Lec19, Liao15 Stochastic CFG for sequence modeling Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: –The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars.
CISC667, F05, Lec19, Liao16 Resources Lecture notes: RNA informatics: Software: -Vienna RNA fold