Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

Similar presentations


Presentation on theme: "Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee."— Presentation transcript:

1 Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee David Mathews Kevin Murphy

2 2 RNA structure RNA sequence 5’ ACGUAGCGA…3’ Tertiary structure a set of base pairs: A-U,C-G, G-U Secondary structure

3 3 Overview accuracy 60% 5’ ACUGCUAGC UGCGUUGC… 3’ input output Energy model Prediction algorithm 71% accuracy New energy model Prediction algorithm predict

4 4 Roles of RNA structures and thermodynamics Translation CatalysisSplicing Gene silencing

5 5 Determining RNA secondary structure Experimentally –X-ray crystallography, NMR, chemical & structure probing -- expensive Computationally –Comparative sequence analysis, given many homologous sequences –Thermodynamic approaches, using an energy model

6 6 Thermodynamic RNA secondary structure prediction Assumption –RNAs fold into their minimum free energy structures Common approach –dynamic programming algorithm O(n 3 ) [Zuker & Stiegler, 1981; Lyngso et al, 1999] Based on an energy model –the Turner model [Mathews et al, 1999, 2004]

7 7 The Turner model [Mathews et al, 1999, 2004] Energy model: Features (stacked pair AG/CU) Parameters θ (-2.1 kcal/mol) Energy function ΔG(θ) = c T θ [3’ UTR protein-binding RNA from Rfam]

8 8 The Turner model [Mathews et al, 1999, 2004] Obtained by –Linear regression from experimental data –Biological knowledge Limitations –No thorough computational method was used –Many parameters have been extrapolated –Large amounts of data were not exploited Accuracy on our data set: 60% Our goal: Improve the RNA energy model

9 9 Parameter esti- mation for models with pseudoknots (Ch 7) Parameter esti- mation for models without pseudoknots (Ch 5) Contributions Model selection and feature relationships (Ch 6) Databases (Ch 3) Parameter estimation algorithms (Ch 4)

10 10 RNA STRAND Structural data from 8 public databases –RNA sequences with known secondary structures unknown free energies –Determined by comparative sequence analysis X-ray crystallography NMR –4600 RNAs, avg. length 530 nucleotides [Andronescu et al, BMC Bioinformatics 2008]

11 11 RNA THERMO Thermodynamic data from 58 papers –RNA sequences with known secondary structures measured free energies –Determined by optical melting experiments [Turner lab & collaborators] –1300 RNAs, avg. length 17 nucleotides

12 12 Parameter esti- mation for models with pseudoknots (Ch 7) Parameter esti- mation for models without pseudoknots (Ch 5) Outline Model selection and feature relationships (Ch 6) Databases: RNA STRAND and RNA THERMO (Ch 3) Parameter estimation algorithms (Ch 4)

13 13 Parameter estimation problem Given –A structural set S (seq + str) –A thermodynamic set T (seq + str + free energy) –A model with a fixed set of features (e.g. Turner99 with 363 features) a free energy function (e.g. linear in the parameters θ) Estimate (learn) parameters θ that maximize avg. accuracy when measured on reference set Sn = #correctly predicted bp / # true bp PPV = #correctly predicted bp / # predicted bp F-measure = harmonic mean (Sn, PPV) = 2*Sn*PPV/(Sn+PPV)

14 14 Constraint Generation (CG) Idea: for all (x,y known ) in S, y known should have lower free energy than all other structures y Predict low energy structures with the current θ Solve a constrained quadratic opt. problem min (Σ δ 2 + Σ (free energy error for T) 2 + regularizer ) subject to ΔG(x,y known,θ) < ΔG(x,y,θ) + δ, for all (x,y known ) in S Repeat until convergence [Andronescu et al, Bioinformatics 2007]

15 15 Boltzmann Likelihood (BL) The probability of a structure y is a Boltzmann function: Solve a non-linear optimization problem with unique optimum max (P(structural data)  P(thermo data)  regularizer) Similar approach (CONTRAfold) proposed by [Do et al, 2006] –no thermo data was used –free energies are not predicted correctly P(structural data) =

16 16 Parameter esti- mation for models with pseudoknots (Ch 7) Parameter esti- mation for models without pseudoknots (Ch 5) Outline Model selection and feature relationships (Ch 6) Databases: RNA STRAND and RNA THERMO (Ch 3) Parameter estimation algorithms: CG and BL (Ch 4)

17 17 Parameter estimation for models without pseudoknots Sensitivity = #correctly predicted bp / # true bp PPV = #correctly predicted bp / # predicted bp BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG*, trained on STrain+T, F=0.68, RMSE=0.98 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 Turner99 F=0.60, RMSE=1.24 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 BL* gives the highest accuracy on average, an increase of 9% from the Turner99 parameters. Set from RNA STRAND, # str: 2500 Avg len: 330 Std len: 500

18 18 Runtime analysis Parameter estimation algorithmCPU time Boltzmann Likelihood (BL)1-8 months Constraint Generation (CG)1-3 days BL is at least 10 times slower than CG, but slightly more accurate. Reference machine: a 3GHz Intel Xeon CPU (1MB cache and 2GB RAM)

19 19 Parameter esti- mation for models with pseudoknots (Ch 7) Parameter esti- mation for models without pseudoknots (Ch 5) 9% better F-measure Outline Model selection and feature relationships (Ch 6) Databases: RNA STRAND and RNA THERMO (Ch 3) Parameter estimation algorithms: CG and BL (Ch 4)

20 20 Model selection Explore parsimonious and lavish models For lavish models, use feature relationships Model#featuresBL F-measure Parsimonious790.646 Turner993630.684 Lavish78020.683

21 21 Feature relationships Link features not covered by thermo set T with those that are covered BL: max (P(structural data)  P(thermo data)  regularizer)

22 22 Model selection and feature relationships BL-FR*, trained on STrain+T, #features=7726, F=0.71, RMSE=1.51 BL*, trained on STrain+T, F=0.69, RMSE=1.34 CG*, trained on STrain+T, F=0.68, RMSE=0.98 CONTRAfold 2.0, trained on SProc F=0.68, RMSE=6.02 CONTRAfold 1.1, trained on 151Rfam F=0.61, RMSE=9.17 Turner99 F=0.60, RMSE=1.24 CG 07 [Andr. 2007], trained on SProc+T F=0.65, RMSE=1.03 Modeling feature relationships improves prediction by an additional 1.3% (10.6% from the Turner99 parameters).

23 23 Parameter esti- mation for models with pseudoknots (Ch 7) Parameter esti- mation for models without pseudoknots (Ch 5) 9% better F-measure Outline Model selection and feature relationships (Ch 6) 11% better F-measure Databases: RNA STRAND and RNA THERMO (Ch 3) Parameter estimation algorithms: CG and BL (Ch 4)

24 24 Parameter estimation for models with pseudoknots Models (Turner features + additional features for pseudoknots) –Dirks & Pierce [Dirks and Pierce, 2003] –Cao & Chen [Cao and Chen, 2006] Prediction algorithm –HotKnots [Ren et al, 2005] Parameter estimation algorithm –CG modified for this problem BL was much harder to implement

25 25 Parameter estimation for models with pseudoknots Params With pknotsWithout pknotsAll ShortLongShortLong #str=78 Len=48 #str=20 Len=170 #str=261 Len=58 #str=87 Len=124 #str=446 Len=74 Initial D&P0.620.510.710.690.68 New D&P0.800.560.810.680.77 Initial C&C0.770.540.710.680.71 New C&C0.750.540.810.710.77 * Short means at most 100 nucleotides Improvements on average: Dirks & Pierce parameters by 9% Cao &Chen parameters by 6%

26 26 Parameter esti- mation for models with pseudoknots (Ch 7) 9% and 6% better F Parameter esti- mation for models without pseudoknots (Ch 5) 9% better F-measure Conclusions Model selection and feature relationships (Ch 6) 11% better F-measure Databases: RNA STRAND and RNA THERMO (Ch 3) Parameter estimation algorithms: CG and BL (Ch 4)

27 27 Applications CG 07 [Andr 2007] is part of RNA Vienna WebSuites Many other software packages benefit from this work –MFE and suboptimal secondary structure prediction –Simulation of folding pathways, sampling and clustering –Prediction of hybridization efficiency, target availability of siRNA

28 28 Directions for future work No single parameter set (or algorithm) results in better accuracy for all structures –Combine parameter sets and algorithms Explore other models –Models for multi-loops are not accurate Accuracy of data is questionable –Obtain / generate / pre-process data more accurately

29 29 Acknowledgments Supervisors: –Anne Condon, Holger Hoos Committee: –Dave Mathews, Kevin Murphy Collaborators: –Vera Bereg, Cristina Pop, Alex Brown Members of the BETA lab and CS department UBC and IBM Research for funding

30 30 Additional slides

31 31 RNAs play diverse roles Messenger RNA Ribosomal RNA Transfer RNA [contexo.info]

32 32 RNA structure plays role in splicing [Bruce R. Korf, Human Genetics and Genomics] [Rogic et al, 2008]

33 33 RNAs can act as catalysts (ribozymes) [James & Al-Shamkhani]

34 34 RNA hybridization thermodynamics [Lu and Mathews, 2008]

35 35 RNA STRAND Database (source)RNA typeNo.Median len Gutell DBrRNA, intron10561500 tmRDBtmRNA726360 Sprinzl tRNA DBtRNA62276 RNase P DBRNase P RNA454330 SRP DBSRP RNA383270 RfamVarious31360 PDB, NDBVarious111250 RNA STRANDAll of the above4666300

36 36 Design of optical melting experiments 16% of multi-loops in RNA STRAND have 5 or more branches 30% of internal loops have ≥7 unpaired bases 13% of internal loops have asymmetry ≥ 3 Pseudoknots (22 experiments, only 4 features out of the 11 DP are covered)

37 37 Analysis of RNA THERMO

38 38 Analysis of RNA THERMO

39 39 Schematic representation of data

40 40 Other BL results (M363) SetTrainRMSES-TestS-STR BL* rho=1S-Full-Train1.340.6790.694 BL rho=5S-Full-Train1.070.6770.687 BL rho=1S-Full-Train-nopkstr1.160.6680.679

41 41 Accuracy on classes Class#LenBL-FR*CG*CF2T99 tRNA582800.790.810.770.60 RNaseP3873320.610.600.670.55 SRP RNA3572230.740.690.640.71 tmRNA2693630.590.500.520.39 16S rRNA18712760.500.48 0.39 5S rRNA1171180.880.780.73 Ham. riboz.114520.640.670.660.65 GI intron783620.600.610.620.56 23S rRNA5226840.550.530.590.47 All25183310.710.68 0.60

42 42 Correlations between parameters

43 43 Accuracy vs length, no pseudoknots

44 44 Accuracy vs length, no pseudoknots

45 45 Correlation accuracies, all

46 46 Correlation accuracies, all

47 47 Correlation accuracies, 0-200

48 48 Correlation accuracies, 200-700

49 49 Correlation accuracies, 700-2000

50 50 Correlation accuracies, 2000-4000

51 51 Sensitivity to the structural set size

52 52 Feature relationships

53 53 Feature relationships

54 54 Feature relationships

55 55 Feature relationships [Davis and Znosko, 2007]

56 56 Feature relationships [Christiansen an Znosko, 2008] – complete set of sequence symmetric tandem mismatches and improved model for predicting sequence asymmetric mismatches

57 57 Model selection and feature relationships 1/64 of STrain 1/4 of STrain

58 58 HotKnots predictions Initial D&P Initial C&C

59 59 DP vs CC, new parameters With pseudoknotsWithout pseudoknots

60 60 DP vs CC, initial parameters With pseudoknotsWithout pseudoknots

61 61 DP, new vs initial With pseudoknotsWithout pseudoknots

62 62 CC, new vs initial With pseudoknotsWithout pseudoknots

63 63 Pseudoknots

64 64 Runtime

65 65 Runtime

66 66 Parameter correlations [Andr 07]

67 67 Feature counts [Andr 07]

68 68 Accuracy vs iterations [Andr 07]


Download ppt "Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee."

Similar presentations


Ads by Google