1 Towards a model for -1 frameshift sites Alain Denise 1,2, Michaël Bekaert 1, Laure Bidou 1, Guillemette Duchateau-Nguyen 1, Jean-Paul Forest 2, Christine Froidevaux 2, Isabelle Hatin 1, Jean-Pierre Rousset 1, Michel Termier 1 1 IGM (Institut de Génétique et Microbiologie) 2 LRI (Laboratoire de Recherche en Informatique) Université Paris-Sud, Orsay
2 Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’ mRNA
3 Translation CAU AUG GAU UAC AUG GUC UAA GAU The ribosome reads bases by triplets (or codons) from a START codon ribosome 5’3’
4 Translation CAU AUG GAU UAC AUG GUC UAA GAU The ribosome synthetizes one amino-acid per codon 5’3’
5 Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
6 Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
7 Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
8 Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
9 Translation CAU AUG GAU UAC AUG GUC UAA GAU The synthesis goes on until a STOP codon is read 5’3’ 1 mRNA gives 1 protein
10 Experimental fact Some mRNAs encode two distinct proteins with same 5’ end
11 Programmed -1 frameshifting Non-deterministic event ORF1a START 0 STOP 0 0 phase STOP -1 ORF1b -1 phase usual translation -1 frameshift 1 mRNA gives 2 distinct proteins with accurate ratio
12 Typical -1 frameshift site [Brierley, 1989] NNX XXY YYZAUG PSP S1 L1L1 S2S2 L2L2 L’1L’1 Slippery sequence Secondary structure 5’ 3’
13 IBV frameshift site UAU UUA AACAUG S1 S2 Slippery sequence Pseudoknot 5’ 3’ GGGUAC UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
14 PK picture ?
15 Translation with frameshift UAU UUA AAC GGG UACAUG 5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
16 Translation with frameshift UAU UUA AAC GGG UAC 5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
17 Translation with frameshift UAU UUA AAC GGG UAC 5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA -1 shift
18 UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
19 UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
20 UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
21 UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
22 Goals To improve the known model for viral frameshift sites To identify new frameshift sites in viral and non viral genomes
23 Our approach Biological sequences Formal models Prediction tools In silico and in vivo validation Applications to other genomes represent explain predict
24 IBV frameshift site: spacer 5’ 3’ GGGUAC
25 Spacer consensus HAST-1UAC AAA BEV UGU UG EAVUGA GAG HCVGAG UC IBVGGG UAC MHVGGG UU TGEVGAG RCNMVUAG GC BWYVGGA GUG PLRVGGG CAA BLVUAA UAG A FIVUGG AAG GC HIV-1GGG AAG AU HTLV-2UCC UUA A JSRUGG GUG A MMTV gag-pro UUG UAA A MMTV pro-pol UGA U RSVUAG GGA SRV-1GGA CUG A Consensus UGG UAG A GAA GUA
26 Lab experiments lacZluc -1 phase pSV40lacZluc 0 phase pSV40 FS signal FS signal N Test construct Control construct Expression reporter FS reporter
27 Spacer: lab experiments Spacerrelative FS rate wild-type IBVGGGUA100 U mutantUGGUA100 A mutant AGGUA 55 C mutantCGGUA 32 CC mutantCCGUA 70 CCU mutantCCUUA 49
28 Refining the model: Machine learning To identify relevant properties that characterize FS sites Disjunctive learning: all sequences do not frameshift for the same reasons [Giedroc et al., 2000]
29 Annotating data: spacer 5’ 3’ GGGUAC
30 Example of data: SP SP = GGGUAC –number of A = 1; C = 1; G = 3; U = 1; –% of A = 33; C = 33; G = 50; U = 33; –first = G; –last = C;
31 Annotating data: stem 1 UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC 5’ 3’
32 Example of data: stem 1 S1 = –5' side : GGGGUAGCAGU –3' side : CCCCAUAGUCG –stability : -20,7 kcal/mol
33 Annotating data: full sequence U UUA AAC 5’ 3’ GGGUAC UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
34 Example of data : FS rate FS rate = 22 %
35 GloBo Disjunctive learning algorithm Suited to small amount of data Won the PTE challenge on analogous data
36 Example of rules If SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %G in S2.5’ 70 then FS rate 5% If %G in S1.5' bottom half 80 and %C in L1 45 then FS rate 5% If SP length 5 and S1.3' length 6 and %C in S1.3' 45 then FS rate 5%...
37 Covering and prediction If SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %G in S2.5’ 70 then FS rate 5% Covering of examples : 70 % Examples predicted in test set :80 %
38 Is R1 relevant for frameshift ? Stem 1 5’-siderelative FSR1 rate wild-type IBVGGGGU AUCAGU 100 yes mutant 1GGUCG AUCAGU 41yes mutant 2GGGGU UCUACA 55yes mutant 3GCUCG AUCAGU 36 no mutant 4GCCCU AUCAGU 73no
39 Covering and prediction If SP length 5 and S1.3' length 6 and %C in S1.3' 45 then FS rate 5% Covering of examples : 45 % Examples predicted in test set :40 %
40 Conclusion Spacer: –correlation between primary sequence and FS rate has been established –systematic experimentation going on
41 Conclusion Biological sequences Formal models Prediction tools In silico and in vivo validation Applications to other genomes
42 GloBo rule covering Run 1 Run 2 Run 3... Rule 1 70 % 80 % 80 % Rule 2 35 % 35 % 40 % Rule 3 45 % 45 % 65 % Rule 4 40 % 50 % 40 % Rule 5 55 % 45 % Rule 6 40 % Average covering of Rule 1 = 80 %
43 Examples of rule 1 SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ % SP length 5 and number of G in S1.5’ bottom half 3 and %C in S1.5’ 45 and number of T in S2.5’ 1 80 % SP length 5 and S1.5' length 6 and number of G in S1.5’ 4 and number of T in S2.5' 1 and %C in S2.3’ %
44 Examples of rule 1 SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ % SP length 5 and number of G in S1.5’ bottom half 3 and %C in S1.5’ 45 and number of T in S2.5’ 1 80 % SP length 5 and S1.5' length 6 and number of G in S1.5’ 4 and number of T in S2.5' 1 and %C in S2.3’ %
45 Conclusion and perspectives Spacer: –correlation between primary sequence and FS rate has been established –systematic experimentation going on Learning: –relevant rules –experimentation enriches data –quantitative approach
46 Future work Interaction between sub-sequences Kinetics of frameshift
47 Current model and future work NNX XXY YYZAUG NNN PSP S1 L1L1 S2S2 L2L2 L’1L’1
48 Outline Biological problem and motivation of study Existing work Towards building a finer model Conclusion and future work
49 Translation CAU AUG GAU UAC AUG GUC UAA GAU The protein synthesis begins with a START triplet Each codon then gives an aminoacid The process ends with a STOP triplet 1 mRNA gives 1 protein mRNA protein
50 Spacer Only its length has been systematically studied so far Its primary sequence is relevant as well
51 On-going work Program that looks for potential frameshift sites Main issues : –to select a reasonable number of candidate sequences –to find actual pseudoknots in an reliable way [Isambert and Siggia, 2001]
52 3G 4G Observations (in vitro) IBV..gggguaucagu....gcugauacccc.. 30% MHV..cgggguacaag....cuuguacccug.. 30% RSV..gggccacug....caguggccc.. 5% Constructions respectant la répartition en guanine G° (kcal/mol) (in vivo) IBV..gggguaucagu....gcugguacccc.. -20,7 22% mutant1..ggucgaucagu....gcuggucgacc.. -20,3 9% mutant2..gggguucuaca....uguagaacccc.. -22,4 12% mutant3..gcgcgcccgcc....ggcgggcgcgc.. -30,7 x% Constructions NE respectant PAS répartition en guanine mutant4..gcucgaucagu....gcuggucgagc.. -20,3 8% mutant5..gcccuaucagu....gcugguagggc.. -20,7 16% mutant6..gccggcccccc....ggggggccggc.. -31,7 x%
53 Spacer: lab experiments (mouse) Spacer FS efficiency GGGTAC14 ± % AGGTAC 13 ± % CGGTAC 8.9 ± % CCGTAC 12.5 ± % CCTTAC21 ± %
54 Recent studies Scanning databases to count frameshift-like sites: [Hammell et al. 1999] Using Stochastic Context-Free Grammars: [Liphardt 1999]
55 Why do we study frameshifting ? To properly annotate genomes To find frameshift sites in other organisms
56 First results Pointed out to new relevant attributes, like position of first mismatch in S1
57 Example of data IBV family= Coronaviridae genus= Coronavirus name= Infectious avian bronchitis virus gene1= ORF1a gene2= ORF1b article= Review Brierley 1995 wild type= yes modified part= none P= {UUUAAAC} SP= {GGGUAC} S1.5'= {GGGGUAGCAGU} L1= {G} S2.5'= {GAGGCUCG} L1'= {} S1.3'= {GCUGAUACCCC} L2={UUGCUAGUGGAUGUGAUCCUGAUGUUGUAAAG} S2.3'= {CGAGCCUU} S1= { stem1= GGGGTAGCAGT stem2= CCCCATAGTCG stability= -20,7 } S2= { stem1= GAGGCTCG stem2= TTCCGAGC stability= unknown } global stability= unknown definite secondary structure= yes L1.folding= no L1'.folding= no L2.folding= no efficiency= RRL 30% efficiency= XO 30%
58 Spacer
59 Example of rules if SP length 5 and number of Gs in S1.5’ bottom half 3 and number of Gs in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ 75 or % G in S1.5' bottom half 80 and %C in L1 45 or SP length 5 and S1.3' length 6 and %C in S1.3' or SP length 5 and number of Gs in S1.5’ bottom half 3 and %C in S1.3’ 70 and %G in S2.3’ 45 or number of As in S1.5' = 0 and number of As in S2.3' = 0 then %FS 5
60 GloBo: main ideas Takes each example as a seed Agglomerates other examples in subset if least general generalization does not cover counterexamples Heuristically selects subsets to cover all examples