Combining RNA and Protein selection models

Combining RNA and Protein selection models
The Central Idea in Comparative Molecular Biology & Genomics Three basic applications Protein secondary structure RNA secondary structure Gene structure Combining Evolution Constraints Protein-Protein RNA-Protein Combining Structure Descriptions

Modelling Sequence Evolution
a - unknown Biological setup TGGTT TCGTA Pi,j(t) continuous time markov chain on the state space {A,C,G,T}. t1 A e t2 C C

Jukes-Cantor 69: Total Symmetry
Rate-matrix, R: T O A C G T F A *a a a a R C a *a a a O G a a * a a M T a a a * a Transition prob. after time t, a = a*t: P(equal) = ¼( e-4*a ) ~ a P(diff.) = ¼( e-4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4.

Comparison of Evolutionary Objects.
Observable Unobservable Goldman, Thorne & Jones, 96 U C G A Knudsen & Hein, 99 Eddy & others Pedersen & Hein, 03 Haussler & others Multiple levels of selection Protein-protein RNA-protein Pedersen, Meyer, Forsberg, Hein,… Observable Unobservable

Finite Set of Rules Generating Strings
Structure Description: Grammars Finite Set of Rules Generating Strings A starting symbol: A set of substitution rules applied to variables in the present string: Regular Context Free finished – no variables Protein secondary structure Gene Structure RNA secondary structure

Simple String Generators
Terminals (capital) Non-Terminals (small) i. Start with S S --> aT bS T --> aS bT  One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSa bSb aa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

Stochastic Grammars i. Start with S. S --> (0.3)aT (0.7)bS
The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba *0.3 *0.7 *0.3 *0.2 ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb S -> aSa -> abSba -> abaaba *0.3 *0.5 *0.1

Gene Describers Simple Prokaryotic Genes: Simple Eukaryotic Genes:

Secondary Structure Generators
S --> LS L F --> dFd LS L --> s dFd Vis alternative grammatikker Put on Chomsky’s Normal Form Knudsen & Hein, 99

Structure Dependent Evolution Models
Protein Secondary Structure Dependent (Goldman, Thorne & Jones) a, b & Loop each has their own mutation rate matrix (20,20) , Ra,Rb & Rloop 2. RNA Secondary Structure Dependent i. R singlet, singlet (4,4) ii. R doublet,doublet (16,16) (base pair conserving relative to R singlet, singlet X R singlet, singlet ) 3. Gene Structure Dependent i. Rnon-coding{ATG-->GTG} ii. Rcoding{ATG-->GTG} iii-. Other structural categories, regulatory signals …..

The Genetic Code 3 classes of sites: 4 2-2 1-1-1-1 Problems:
4 (3rd) (3rd) ii. TA (2nd) Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another.

Kimura’s 2 parameter model & Li’s Model.
Probabilities: Rates: start b a Selection on the 3 kinds of sites (a,b)(?,?) (f*a,f*b) (a,f*b) (a, b)

alpha-globin from rabbit and mouse.
Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile Sites Total Conserved Transitions Transversions (.8978) 12(.0438) (.0584) (.6623) 21(.2727) (.0649) (.6026) 16(.2051) (.1923) Z(at,bt) = .50[1+exp(-2at) - 2exp(-t(a+b)] transition Y(at,bt) = .25[1-exp(-2bt )] (transversion) X(at,bt) = .25[1+exp(-2at) + 2exp(-t(a+b)] identity L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15} where a = at and b = bt. Estimated Parameters: a = b = *b = (a + 2*b) = f = Transitions Transversions a*f = *b*f = a = *b*f = a = *b = Expected number of: replacement substitutions synonymous Replacement sites : (0.3742/0.6744)*77 = Silent sites : = Ks = Ka = .1127

Three Questions HMM/Stochastic Regular Grammar: W WL WR j L 1 i i’ j’
What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? HMM/Stochastic Regular Grammar: O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 W Stochastic Context Free Grammars: WL WR j L 1 i i’ j’

Comparative Gene Finding Jakob Skou Pedersen & Hein, 2004

Knudsen & Hein, 99

From Knudsen & Hein (1999)

Knudsen and Hein, 2003

Why combine RNA & Protein Models?
Short Term/Long Term Evolution Discrepancies Separating Selective Effects Analyzing one level without interference from the other level Predicting gene structure and RNA structure better. Annotation of Viral Genomes

Combining Levels of Selection.
Assume multiplicativity: fA,B = fA*fB Protein-Protein Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic Jensen & Pedersen, 2001 Contagious Dependence Protein-RNA Singlet Doublets Contagious Dependence

Overlapping Coding Regions
Hein & Stoevlbaek, 95 1st 2nd sites 2-2 4 (f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b) (f1a, f1f2b) (f2a, f1f2b) (a, f2b) (f1a, f1b) (a, f1b) (a, b) pol gag Example: Gag & Pol from HIV Gag Pol sites 2-2 4 MLE: a= b= a+2b= fgag= fpol=.229 Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.

HIV2 Analysis Hasegawa, Kisino & Yano Subsitution Model Parameters:
a*t β*t pA pC pG pT Selection Factors GAG (s.d ) POL (s.d ) VIF (s.d ) VPR (s.d ) TAT (s.d ) REV (s.d ) VPU (s.d ) ENV (s.d ) NEF (s.d ) Estimated Distance per Site:

Evolution under double constraints
Codon Nucleotide Independence Heuristic Singlet Ri,j =f* qi,j Doublet R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)

Structure Prediction: Hepatitis C Analysis
A A A C G - U C C C – U U A G U U U A A – U G C C C U U C U C A A C A G U

Evolution Models: A hierarchy of hypotheses
Singlet Doublets 1 2 3 1 2 3 1 2 3 ts/tv=2.00 3 (ts/tv)=1.50,1.26,3.05 3 (ts/tv, equil.) Doublet/ singlet ratio L= L= L= L= L= L= 1 2 3 4 5 0.173 0.415 0.414 0.292 (f1:0.24,f2:0.14) Codon Factors transversion transition, ratio Duplet distortion # parameters Likelihood - + 7 9 15 17

Combined RNA & Protein Structure
Gene Structure Fixed, RNA Structure Stochastic Presently being implemented with viral analysis in mind Both RNA & Gene Structure Stochastic Would imply Gene Finding as well. Grammar for overlapping genes a new phenomena Gene Structure Stochastic, RNA Structure Fixed An untypical situation A challenge for the future: structure evolution.

Open Problems Stacking Substitution Models
In principle a 44 times 44 matrix ( entries!!) is need, but proper parametrisation and symmetries is could reduce this substantially. N3 N4 N2 N1 Other Sets of Constraints: Regulatory Signals Combining with Alignment A C G T A C T G T C T G T

References. Hein,J & J.Stoevlbaek (1995) “A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames” J.Mol.Evol Jensen,JL & Pedersen (2001) “Probabilistic models of DNA sequence evolution with context dependent rates of subsitution” Adv. Appl.Prob Katz and Burge (2003) “Widespread Selection for Local RNA Secondary Structure in Coding Regions of Bacterial Genes. Genome Research Kirby, AK, SV Muse & W.Stephan (1995) “Maintenance of pre-mRNA secondary structure by epistatic selection” PNAS Knudsen, Hein 99 “Predicting RNA Structure using Stochastic Context Free Grammars and Molecular Evolution” Bioinformatics Knudsen and Hein (2003) “Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acid Research New Influenza gene article??? Meyer and Durbin (2002) “Comparative Ab Initio prediction of Gene Structure using pair HMMs” Bioinformatics Moulton, V., Zuker, M. Steel, M., Penny, D. and Pointon, R. “Metrics on RNA Structures”. J. Computational Biology, 7 (1): , (2000). Pedersen, AMK & JL Jensen (2001) “A Dependent – Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames” Mol.Biol.Evol Pedersen JS & J. Hein 2003 – “Gene finding with a Hidden Markov Model of genome structure and evolution” Bioinformatics Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “An evolutionary model for protein coding regions with RNA secondary structure” Manuscript in Preparation Pedersen, Forsberg, Meyer, Simmonds and Hein (2003) “Structure Models” Manuscript in Preparation Schadt, E. & K.Lange (2002) “Codon and Rate Variation Models in Molecular Phylogeny” Mol.Biol.Evol Savill, NJ et al (2001) “RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substituin Ratye Models Using Maximum-Likehood Methods” Genetics Jan Simmonds, P. and DB Smith (July1999) “Structural Constraints on RNA Virus Evolution” J.of Virology Tillier ERM & RA Collins (1998) “High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA” Genetics Yang, Z. et al. (1995) “Molecular Evolution of the Hepatitis B Virus Genome” J.Mol.Evol

Acknowledgements 1. Comparative RNA Structure - Bjarne Knudsen
2. Comparative Gene Structure - Jakob Skou Pedersen 3. Integrating Levels of Selection & Structure: Jakob Skou Pedersen, Irmtraud Meyer, Roald Forsberg Bjarne Knudsen Irmtraud Meyer Roald Forsberg Jakob Skou Pedersen

Combining RNA and Protein selection models

Similar presentations

Presentation on theme: "Combining RNA and Protein selection models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combining RNA and Protein selection models

Similar presentations

Presentation on theme: "Combining RNA and Protein selection models"— Presentation transcript:

Similar presentations

About project

Feedback