Some new sequencing technologies

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
CS5371 Theory of Computation
Lecture 6, Thursday April 17, 2003
CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Phylogeny Tree Reconstruction
CS273a Lecture 5, Win07, Batzoglou Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort contigs from largest to smallest,
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012 Adapted from Haixu Tang.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Some Probability Theory and Computational models A short overview.
The CYK Algorithm Presented by Aalapee Patel Tyler Ondracek CS6800 Spring 2014.
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
CS5263 Bioinformatics RNA Secondary Structure Prediction.
Prediction of Secondary Structure of RNA
RNA folding & ncRNA discovery
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Stochastic Context Free Grammars CBB 261 for noncoding RNA gene prediction B. Majoros.
Transparency No. 1 Formal Language and Automata Theory Homework 5.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
RNA secondary structure Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3- Combinatorial Motif Finding Lecture 4-Statistical Motif Finding.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Learning to Align: a Statistical Approach
Stochastic Context-Free Grammars for Modeling RNA
Lecture 21 RNA Secondary Structure Prediction
Stochastic Context-Free Grammars for Modeling RNA
RNA folding & ncRNA discovery
Stochastic Context Free Grammars for RNA Structure Modeling
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
Presentation transcript:

Some new sequencing technologies

Molecular Inversion Probes The availability of large collections of single nucleotide polymorphisms (SNPs), along with the recent large-scale linkage disequilibrium mapping efforts, have brought the promise of whole genome association studies to the forefront of current thinking in human genetics. ParAllele (now part of Affymetrix) has developed a novel technology based on the concept of Molecular Inversion Probes that enables up to 20,000 SNPs to be scored in a single assay. This unprecedented level of multiplexing is made possible through exquisite enzymological specificity using a unimolecular interaction that is insensitive to cross-reactivity among multiple probe molecules. The technology has been demonstrated to exhibit high accuracy while enabling a high rate of conversion of individual SNPs into working multiplexed assays. Molecular Inversion Probes Molecular Inversion Probes are so named because the oligonucleotide probe central to the process undergoes a unimolecular rearrangement from a molecule that cannot be amplified (step 1), into a molecule that can be amplified (step 6). This rearrangement is mediated by hybridization to genomic DNA (step 2) and an enzymatic "gap fill" process that occurs in an allele-specific manner (step 3). The resulting circularized probe can be separated from cross-reacted or unreacted probes by a simple exonuclease reaction (step 4). Figure 1 shows these steps. Applications of Molecular Inversion Probes Molecular Inversion Probe technology is invaluable as a high-throughput SNP genotyping method for both targeted and whole genome SNP analysis projects as well as allele quantitation.

Single Molecule Array for Genotyping—Solexa Genomic DNA is extracted from sample cells taken from an individual.  In a single tube reaction, genomic DNA is processed into single-stranded oligonucleotide fragments. These are prepared for attachment to Solexa’s Single Molecule Arrays using proprietary primer and anchor molecules. Hundreds of millions of molecules, representing the entire genome of the individual, are deposited and attached at discrete sites on a Single Molecule Array. Fluorescently labelled nucleotides and a polymerase enzyme are added to the Single Molecule Array. Complementary nucleotides base-pair to the first base of each oligonucleotide fragment and are added to the primer by the enzyme. Remaining free nucleotides are removed. Laser light of a specific wavelength for each base excites the label on the incorporated nucleotides, which fluoresce. This fluorescence is detected by a CCD camera that rapidly scans the entire array to identify the incorporated nucleotides on each fragment. Fluorescence is then removed. The identity of the incorporated nucleotide reveals the identity of the base in the sample sequence to which it is paired. In this example, the first base is C (cytosine). This cycle of incorporation, detection and identification is repeated approximately 25 times to determine the first 25 bases in each oligonucleotide fragment. By simultaneously sequencing all molecules on the array the first 25 bases for the hundreds of millions of oligonucleotide fragments are determined. These hundreds of millions of sequences are aligned and compared to the reference sequence using Solexa’s proprietary bioinformatics system. Known and unknown single nucleotide polymorphisms (SNP’s) together with other genetic variations can then be readily determined.

Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm Figure 1 . A nanopore sensor for sequencing DNA. A channel or nanopore in an insulating membrane separates two ionic solution-filled compartments. In response to a voltage bias (labeled “ - ” and “+”) across the membrane, ssDNA molecules (yellow) in the “-” compartment are driven, one at a time, into and through the channel. Embedded in the membrane, an electrically connected nanotube (orange) that abuts on the nanopore serves as a sensor to identify the nucleotides in the translocating DNA molecules. Elevated temperatures and denaturants maintain the DNA in an unstructured, single-stranded form. The underlying principle of nanopore sequencing is that a single-stranded DNA or RNA molecule can be electrophoretically driven through a nano-scale pore in such a way that the molecule traverses the pore in strict linear sequence, as illustrated in Figure 1.  Because a translocating molecule partially obstructs or blocks the nanopore, it alters the pore's electrical properties 1 .  http://www.mcb.harvard.edu/branton/index.htm

Pyrosequencing The Pyrosequencing™ technology is a relatively new DNA sequencing method. The technology has been commercialized and is today marketed by Biotage AB. The technique utilizes the cooperativity between four different enzymes and the phenomenon of bioluminescence to monitor the incorporation of nucleotides into the DNA. A short description of the steps in the Pyrosequencing process is given below. Initial step The reaction mixture consists of the four enzymes (DNA polymerase, ATP sulfurylase, luciferase and apyrase), different substrates needed for the reactions and the single stranded DNA to be sequenced. Step 1 - Polymerase One of the four nucleotides dNTP (dATP, dCTP, dGTP, dTTP) is added to the reaction mixture. If the added nucleotide is complementary to the base in the DNA strand, it is incorporated and inorganic pyrophosphate (PPi) is released. Step 2 - ATP sulfurylase The PPi is converted into ATP by the enzyme ATP sulfurylase. Step 3 - Luciferase The luciferase catalyzes a reaction where ATP is used to generate light. The amount of light is proportional to the amount of ATP, and hence also proportional to the amount of incorporated nucleotides via the PPi. The light is then detected by a CCD camera. Step 4 - Apyrase Remaining dNTP and ATP are degraded by the apyrase before the next nucleotide in the iterative cycle is added to the reaction mixture.

Pyrosequencing on a chip Mostafa Ronaghi, Stanford Genome Technologies Center 454 Life Sciences

Polony Sequencing "Polonies" are tiny colonies of DNA, about one micron in diameter, grown on a glass microscope slide (the word itself is a contraction of "polymerase colony"). To create them, researchers first pour a solution containing chopped-up DNA onto the slide. Adding an enzyme called polymerase causes each piece to copy itself repeatedly, creating millions of polonies, each dot containing only copies of the original piece of DNA. The polonies are then exposed to a series of chemically-labeled probes that light up when run through a scanning machine, identifying each nucleotide base in the strand of code, much as dusting with powder allows crime-scene investigators to bring up fingerprints on a surface. Prior to sequencing, dsDNA is denatured and unbound copy strands are washed away. - Covalently linked template strands allow for washing.

Some future directions for sequencing Personalized genome sequencing Find your ~1,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements Goals: Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: Inexpensive sequencing: 2010-2015 Genotype–phenotype association: 2010-??? Personalized drugs: 2015-???

Some future directions for sequencing 2. Environmental sequencing Find your flora: organisms living in your body External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: Inexpensive research sequencing: today Research & associations within next 10 years Personalized sequencing 2015+ Find diversity of organisms living in different environments Hard to isolate Assembly of all organisms at once

Some future directions for sequencing Organism sequencing Sequence a large fraction of all organisms Deduce ancestors Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved

RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc

RNA and Translation

RNA and Splicing

Hairpin Loops Interior loops Stems Multi-branched loop Bulge loop

Tertiary Structure Secondary Structure

Modeling RNA Secondary Structure: Context-Free Grammars

A Context Free Grammar S  AB Nonterminals: S, A, B A  aAc | a Terminals: a, b, c, d B  bBd | b Production Rules: 5 rules Derivation: Start from the S nonterminal; Use any production rule replacing a nonterminal with a terminal, until no more nonterminals are present S  AB  aAcB  …  aaaacccB  aaaacccbBd  …  aaaacccbbbbbdddd Produces all strings ai+1cibj+1dj, for i, j  0

Example: modeling a stem loop S  a W1 u W1  c W2 g W2  g W3 c W3  g L c L  agugc What if the stem loop can have other letters in place of the ones shown? AG U CG ACGG UGCC

Example: modeling a stem loop S  a W1 u | g W1 u W1  c W2 g W2  g W3 c | g W3 u W3  g L c | a L u L  agucg | agccg | cugugc More general: Any 4-long stem, 3-5-long loop: S  aW1u | gW1u | gW1c | cW1g | uW1g | uW1a W1  aW2u | gW2u | gW2c | cW2g | uW2g | uW2a W2  aW3u | gW3u | gW3c | cW3g | uW3g | uW3a W3  aLu | gLu | gLc | cLg | uLg | uLa L  aL1 | cL1 | gL1 | uL1 L1  aL2 | cL2 | gL2 | uL2 L2  a | c | g | u | aa | … | uu | aaa | … | uuu AG U CG ACGG UGCC AG C CG GCGA UGCU CUG U CG GCGA UGUU

A parse tree: alignment of CFG to sequence S  a W1 u W1  c W2 g W2  g W3 c W3  g L c L  agucg AG U CG ACGG UGCC S W1 W2 W3 L A C G G A G U G C C C G U

Alignment scores for parses! We can define each rule X  s, where s is a string, to have a score. Example: W  g W’ c: 3 (forms 3 hydrogen bonds) W  a W’ u: 2 (forms 2 hydrogen bonds) W  g W’ u: 1 (forms 1 hydrogen bond) W  x W’ z -1, when (x, z) is not an a/u, g/c, g/u pair Questions: How do we best align a CFG to a sequence? (DP) How do we set the parameters? (Stochastic CFGs)

The Nussinov Algorithm C C Let’s forget CFGs for a moment Problem: Find the RNA structure with the maximum (weighted) number of nested pairings A G C C G G C A U A U U A A A C U G U G A C A C A A A A C U C G G C U G U G U C G G A G C C U U G A G G C G G A G C G A U G C A U C A A U U G A ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG

The Nussinov Algorithm Given sequence X = x1…xN, Define DP matrix: F(i, j) = maximum number of weighted bonds if xi…xj folds optimally Two cases, if i < j: xi is paired with xj F(i, j) = s(xi, xj) + F(i+1, j – 1) xi is not paired with xj F(i, j) = max{ k: i  k < j } F(i, k) + F(k+1, j) i j i j i k j

The Nussinov Algorithm Initialization: F(i, i-1) = 0; for i = 2 to N F(i, i) = 0; for i = 1 to N Iteration: For l = 2 to N: For i = 1 to N – l j = i + l – 1 F(i+1, j – 1) + s(xi, xj) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) Termination: Best structure is given by F(1, N) (Need to trace back; refer to the Durbin book)

The Nussinov Algorithm and CFGs Define the following grammar, with scores: S  g S c : 3 | c S g : 3 a S u : 2 | u S a : 2 g S u : 1 | u S g : 1 S S : 0 | a S : 0 | c S : 0 | g S : 0 | u S : 0 |  : 0 Note:  is the “” string Then, the Nussinov algorithm finds the optimal parse of a string with this grammar

The Nussinov Algorithm Initialization: F(i, i-1) = 0; for i = 2 to N F(i, i) = 0; for i = 1 to N S  a | c | g | u Iteration: For l = 2 to N: For i = 1 to N – l j = i + l – 1 F(i+1, j – 1) + s(xi, xj) S  a S u | … F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) S  S S Termination: Best structure is given by F(1, N)

Stochastic Context Free Grammars In an analogy to HMMs, we can assign probabilities to transitions: Given grammar X1  s11 | … | sin … Xm  sm1 | … | smn Can assign probability to each rule, s.t. P(Xi  si1) + … + P(Xi  sin) = 1

Example S  a S b : ½ a : ¼ b : ¼ Probability distribution over all strings x: x = anbn+1, then P(x) = 2-n  ¼ = 2-(n+2) x = an+1bn, same Otherwise: P(x) = 0

Computational Problems Calculate an optimal alignment of a sequence and a SCFG (DECODING) Calculate Prob[ sequence | grammar ] (EVALUATION) Given a set of sequences, estimate parameters of a SCFG (LEARNING)

Normal Forms for CFGs Chomsky Normal Form: X  YZ X  a All productions are either to 2 nonterminals, or to 1 terminal Theorem (technical) Every CFG has an equivalent one in Chomsky Normal Form (The grammar in normal form produces exactly the same set of strings)

Example of converting a CFG to C.N.F. S S  ABC A  Aa | a B  Bb | b C  CAc | c Converting: S  AS’ S’  BC A  AA | a B  BB | b C  DC’ | c C’  c D  CA A B C A a B b C A c a B b c a b S A S’ A A B C a a B B D C’ B B b C A c b b c a

Another example Converting: S  ABC A  C | aA B  bB | b C  cCd | c S  AS’ S’  BC A  C’C’’ | c | A’A A’  a B  B’B | b B’  b C  C’C’’ | c C’  c C’’  CD D  d

Decoding: the CYK algorithm Given x = x1....xN, and a SCFG G, Find the most likely parse of x (the most likely alignment of G to x) Dynamic programming variable: (i, j, V): likelihood of the most likely parse of xi…xj, rooted at nonterminal V Then, (1, N, S): likelihood of the most likely parse of x by the grammar

The CYK algorithm (Cocke-Younger-Kasami) Initialization: For i = 1 to N, any nonterminal V, (i, i, V) = log P(V  xi) Iteration: For i = 1 to N – 1 For j = i+1 to N For any nonterminal V, (i, j, V) = maxXmaxYmaxik<j (i,k,X) + (k+1,j,Y) + log P(VXY) Termination: log P(x | , *) = (1, N, S) Where * is the optimal parse tree (if traced back appropriately from above) i j V X Y

A SCFG for predicting RNA structure S  a S | c S | g S | u S |   S a | S c | S g | S u  a S u | c S g | g S u | u S g | g S c | u S a  SS Adjust the probability parameters to reflect bond strength etc No distinction between non-paired bases, bulges, loops Can modify to model these events L: loop nonterminal H: hairpin nonterminal B: bulge nonterminal etc

CYK for RNA folding Initialization: (i, i-1) = log P() Iteration: For i = 1 to N For j = i to N (i+1, j–1) + log P(xi S xj) (i, j–1) + log P(S xi) (i, j) = max (i+1, j) + log P(xi S) maxi < k < j (i, k) + (k+1, j) + log P(S S)

Evaluation Recall HMMs: Forward: fl(i) = P(x1…xi, i = l) Backward: bk(i) = P(xi+1…xN | i = k) Then, P(x) = k fk(N) ak0 = l a0l el(x1) bl(1) Analogue in SCFGs: Inside: a(i, j, V) = P(xi…xj is generated by nonterminal V) Outside: b(i, j, V) = P(x, excluding xi…xj is generated by S and the excluded part is rooted at V)

The Inside Algorithm V X Y To compute a(i, j, V) = P(xi…xj, produced by V) a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V  XY) V X Y i k k+1 j

Algorithm: Inside Initialization: For i = 1 to N, V a nonterminal, a(i, i, V) = P(V  xi) Iteration: For i = 1 to N-1 For j = i+1 to N For V a nonterminal a(i, j, V) = X Y k a(i, k, X) a(k+1, j, X) P(V  XY) Termination: P(x | ) = a(1, N, S)

The Outside Algorithm Y V X b(i, j, V) = Prob(x1…xi-1, xj+1…xN, where the “gap” is rooted at V) Given that V is the right-hand-side nonterminal of a production, b(i, j, V) = X Y k<i a(k, i-1, X) b(k, j, Y) P(Y  XV) Y V X k i j

Algorithm: Outside Initialization: b(1, N, S) = 1 For any other V, b(1, N, V) = 0 Iteration: For i = 1 to N-1 For j = N down to i For V a nonterminal b(i, j, V) = X Y k<i a(k, i-1, X) b(k, j, Y) P(Y  XV) + X Y k<i a(j+1, k, X) b(i, k, Y) P(Y  VX) Termination: It is true for any i, that: P(x | ) = X b(i, i, X) P(X  xi)

Learning for SCFGs We can now estimate c(V) = expected number of times V is used in the parse of x1….xN 1 c(V) = –––––––– 1iNijN a(i, j, V) b(i, j, v) P(x | ) c(VXY) = –––––––– 1iNi<jN ik<j b(i,j,V) a(i,k,X) a(k+1,j,Y) P(VXY)

Learning for SCFGs Then, we can re-estimate the parameters with EM, by: c(VXY) Pnew(VXY) = –––––––––––– c(V) c(V  a) i: xi = a b(i, i, V) P(V  a) Pnew(V  a) = –––––––––– = –––––––––––––––––––––––––––––––– c(V) 1iNi<jN a(i, j, V) b(i, j, V)

Summary: SCFG and HMM algorithms GOAL HMM algorithm SCFG algorithm Optimal parse Viterbi CYK Estimation Forward Inside Backward Outside Learning EM: Fw/Bck EM: Ins/Outs Memory Complexity O(N K) O(N2 K) Time Complexity O(N K2) O(N3 K3) Where K: # of states in the HMM # of nonterminals in the SCFG

The Zuker algorithm – main ideas Models energy of an RNA fold Instead of base pairs, pairs of base pairs (more accurate) Separate score for bulges Separate score for different-size & composition loops Separate score for interactions between stem & beginning of loop Can also do all that with a SCFG, and train it on real data