CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure
CISC667, S07, Lec19, Liao

Mainly as “information carrier” in protein synthesis
Facts about RNAs Mainly as “information carrier” in protein synthesis mRNA tRNA Also as catalysts Ribozymes Also as regulator RNAi Single strand Intra-molecular pairing Secondary structure Essential for sequence stability and function Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs CISC667, S07, Lec19, Liao

CISC667, S07, Lec19, Liao

hybridization pairing: A-U, and C-G
RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. G C A U U A C G C A G A U x C C x U G x G seq1 seq2 seq3 seq1 C A G G A A A C U G seq2 G C U G C A A A G C seq3 G C U G C A A C U G CISC667, S07, Lec19, Liao

Representations for palindromic structures 1)
2) 3) G C A U C A G G A A A C U G C A G G A A A C U G ( ( ( ) ) ) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds. CISC667, S07, Lec19, Liao

Secondary structure prediction
Base pair maximization algorithm [Nussinov] mi,j: maximal # of base pairs that can be formed for sequence si…sj. di,j = 1, if si and sj are paired = 0, otherwise Initialization mi,i-1 = for i= [2, L] mi,i = for i = [1,L] Recursion mi,j= max {mi+1,j, mi,j-1, mi+1,j-1+ di,j, maxi<k<j {mi,k+ mk+1,j} // bifurcation } CISC667, S07, Lec19, Liao

Dynamic programming table
1 2 3 4 5 6 7 8 9 G G G A A A U C C 1 G 1 2 3 2 G 1 2 3 3 G 1 2 2 4 A 1 1 1 A A 5 A 1 1 1 6 A 1 1 1 A U 7 U G C 8 C G C 9 C G Time Complexity: O(L3), where L is sequence length Space complexity: O(L2) CISC667, S07, Lec19, Liao

Secondary structure prediction
Minimum energry algorithm [Zuker] Ei,j: minimum energy for an optimal fold formed by sequence si…sj. ei,j = -5, if si , sj is CG or GC = -4, if si , sj is AU or UA = -1, if si , sj is GU or UG Initialization ei,i-1 = for i= [2, L] ei,i = for i = [1,L] Recursion Ei,j= min {Ei+1,j, Ei,j-1, Ei+1,j-1+ ei,j, mini<k<j {Ei,k+ Ek+1,j} // bifurcation } CISC667, S07, Lec19, Liao

We know the “language” is not random
Motivation: Both DNA and protein sequences can be considered as sentences in a “language”. We know the “language” is not random What’s the grammar of the language? “The linguistics of DNA” – David Searls, American Scientist, 80( CISC667, S07, Lec19, Liao

Chomsky hierarchy of transformational grammars (1956)
Regular expression (no recursive structure, cannot handle palindromes) W  aW, or W  a Context-free W   Context-sensitive 1W 2  1  2 Unrestricted grammars 1W 2   Notations: a: any terminals; , : any string of non-terminals and/or terminals, including the null string; : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: Wv  Wy Wz Wv a CISC667, S07, Lec19, Liao

To capture the palindromic structure, a context-free grammar can be given as follows.
S  aW1u | cW1g | gW1c | uW1a W1 aW2u| cW2g| gW2c| uW2a W2 aW3u| cW3g| gW3c| uW3a W3  gaaa | gcaa seq1 C A G G A A A C U G seq2 G C U G C A A A G C c a g g a a a c u g w3 w2 w1 S CISC667, S07, Lec19, Liao

Stochastic grammars Motivations:
Irregularities (or exceptions to the grammar) in languages Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. This can be achieved by assigning probabilities to various productions. For any non-terminals, the probabilities of all its possible productions must add to 1. Given a sentence x, a stochastic grammar  parsing the sentence will assign a probability P(x|) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. x language P(x| ) =1 CISC667, S07, Lec19, Liao

Stochastic CFG for sequence modeling
Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars. CISC667, S07, Lec19, Liao

Resources Lecture notes: RNA informatics: Software: -Vienna RNA fold CISC667, S07, Lec19, Liao

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.

Similar presentations

Presentation on theme: "CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.

Similar presentations

Presentation on theme: "CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao."— Presentation transcript:

Similar presentations

About project

Feedback