Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.

Similar presentations


Presentation on theme: "CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao."— Presentation transcript:

1 CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure
CISC667, S07, Lec19, Liao

2 Mainly as “information carrier” in protein synthesis
Facts about RNAs Mainly as “information carrier” in protein synthesis mRNA tRNA Also as catalysts Ribozymes Also as regulator RNAi Single strand Intra-molecular pairing Secondary structure Essential for sequence stability and function Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs CISC667, S07, Lec19, Liao

3 CISC667, S07, Lec19, Liao

4 CISC667, S07, Lec19, Liao

5 CISC667, S07, Lec19, Liao

6 hybridization pairing: A-U, and C-G
RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. G C A U U A C G C A G A U x C C x U G x G seq1 seq2 seq3 seq1 C A G G A A A C U G seq2 G C U G C A A A G C seq3 G C U G C A A C U G CISC667, S07, Lec19, Liao

7 Representations for palindromic structures 1)
2) 3) G C A U C A G G A A A C U G C A G G A A A C U G ( ( ( ) ) ) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds. CISC667, S07, Lec19, Liao

8 Secondary structure prediction
Base pair maximization algorithm [Nussinov] mi,j: maximal # of base pairs that can be formed for sequence si…sj. di,j = 1, if si and sj are paired = 0, otherwise Initialization mi,i-1 = for i= [2, L] mi,i = for i = [1,L] Recursion mi,j= max {mi+1,j, mi,j-1, mi+1,j-1+ di,j, maxi<k<j {mi,k+ mk+1,j} // bifurcation } CISC667, S07, Lec19, Liao

9 Dynamic programming table
1 2 3 4 5 6 7 8 9 G G G A A A U C C 1 G 1 2 3 2 G 1 2 3 3 G 1 2 2 4 A 1 1 1 A A 5 A 1 1 1 6 A 1 1 1 A U 7 U G C 8 C G C 9 C G Time Complexity: O(L3), where L is sequence length Space complexity: O(L2) CISC667, S07, Lec19, Liao

10 Secondary structure prediction
Minimum energry algorithm [Zuker] Ei,j: minimum energy for an optimal fold formed by sequence si…sj. ei,j = -5, if si , sj is CG or GC = -4, if si , sj is AU or UA = -1, if si , sj is GU or UG Initialization ei,i-1 = for i= [2, L] ei,i = for i = [1,L] Recursion Ei,j= min {Ei+1,j, Ei,j-1, Ei+1,j-1+ ei,j, mini<k<j {Ei,k+ Ek+1,j} // bifurcation } CISC667, S07, Lec19, Liao

11 We know the “language” is not random
Motivation: Both DNA and protein sequences can be considered as sentences in a “language”. We know the “language” is not random What’s the grammar of the language? “The linguistics of DNA” – David Searls, American Scientist, 80( CISC667, S07, Lec19, Liao

12 Chomsky hierarchy of transformational grammars (1956)
Regular expression (no recursive structure, cannot handle palindromes) W  aW, or W  a Context-free W   Context-sensitive 1W 2  1  2 Unrestricted grammars 1W 2   Notations: a: any terminals; , : any string of non-terminals and/or terminals, including the null string; : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: Wv  Wy Wz Wv a CISC667, S07, Lec19, Liao

13 To capture the palindromic structure, a context-free grammar can be given as follows.
S  aW1u | cW1g | gW1c | uW1a W1 aW2u| cW2g| gW2c| uW2a W2 aW3u| cW3g| gW3c| uW3a W3  gaaa | gcaa seq1 C A G G A A A C U G seq2 G C U G C A A A G C c a g g a a a c u g w3 w2 w1 S CISC667, S07, Lec19, Liao

14 Stochastic grammars Motivations:
Irregularities (or exceptions to the grammar) in languages Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. This can be achieved by assigning probabilities to various productions. For any non-terminals, the probabilities of all its possible productions must add to 1. Given a sentence x, a stochastic grammar  parsing the sentence will assign a probability P(x|) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. x language P(x| ) =1 CISC667, S07, Lec19, Liao

15 Stochastic CFG for sequence modeling
Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars. CISC667, S07, Lec19, Liao

16 Resources Lecture notes: RNA informatics: Software: -Vienna RNA fold CISC667, S07, Lec19, Liao


Download ppt "CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao."

Similar presentations


Ads by Google