Download presentation
Presentation is loading. Please wait.
Published byGarrett Roth Modified over 9 years ago
1
Algorithmics of -1 frameshift RNA sequences Michaël Bekaert 1, Laure Bidou 1, Alain Denise 1,2, Guillemette Duchateau-Nguyen 1, Céline Fabret 1 Jean-Paul Forest 2, Christine Froidevaux 2, Isabelle Hatin 1, Jean-Pierre Rousset 1, Michel Termier 1 1 IGM (Institut de Génétique et Microbiologie) 2 LRI (Laboratoire de Recherche en Informatique) Université Paris-Sud, Orsay
2
Flow of genetic information transcription translation replication CATATGGATTACATGGTCTAAGAT DNA sequence CAU AUG GAU UAC AUG GUC UAA GAU RNA sequence Protein
3
Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’ mRNA
4
Translation CAU AUG GAU UAC AUG GUC UAA GAU The ribosome reads bases by triplets (or codons) from a START codon ribosome 5’3’
5
Translation CAU AUG GAU UAC AUG GUC UAA GAU The ribosome synthetizes one amino-acid per codon 5’3’
6
Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
7
Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
8
Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
9
Translation CAU AUG GAU UAC AUG GUC UAA GAU 5’3’
10
Translation CAU AUG GAU UAC AUG GUC UAA GAU The synthesis goes on until a STOP codon is read 5’3’ 1 mRNA gives 1 protein
11
Experimental fact Some mRNAs encode two distinct proteins with same beginning
12
Programmed -1 frameshifting Non-deterministic event ORF1a START 0 STOP 0 0 phase STOP -1 ORF1b -1 phase usual translation -1 frameshift 1 mRNA gives 2 distinct proteins with accurate ratio
13
Typical -1 frameshift site [Brierley, 1989] NNX XXY YYZAUG PSP S1 L1L1 S2S2 L2L2 L’1L’1 Slippery sequence Secondary structure 5’ 3’
14
IBV frameshift site UAU UUA AACAUG S1 S2 Slippery sequence Pseudoknot 5’ 3’ GGGUAC UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
15
Translation with frameshift UAU UUA AAC GGG UACAUG5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
16
Translation with frameshift UAU UUA AAC GGG UAC5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA
17
Translation with frameshift UAU UUA AAC GGG UAC5’ 3’ UGACGAUGGGGUGACGAUGGGG GCUGAUACCCCGCUGAUACCCC A G G C U C G U C C G A G C G UUGC GAAA -1 shift
18
UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
19
UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
20
UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
21
UA UUU AAA CGG GUA CGG GGU AGC AGU Translation with frameshift 5’ 3’
22
Translation : mRNA & ribosome Adapted from Frank et al. by Giedroc et al.
23
Biological or random sequences Folded sequences In silico and in vivo validation Folding Wild-type folded sequences Folded and sorted sequences New FS sites Mutant sequences Score matrix RulesVoting Model
24
Search for FS sites: the easy part Slippery sequence in -1 phase with START codon ATG N NNN NNXXXYYYZ
25
Search for FS sites : the not-so-easy part Search of secondary structure AGGACCT ? Folding
26
Example of a folded structure Picture from Lyngso and Pedersen 2000
27
Folding algorithms Aligned sequences Zuker’s Rivas & Eddy’s
28
Algorithms that require aligned sequences Not relevant to our problem since we only fold one sequence at the same time
29
Folding using Zuker’s model Tractable model based on additive energy minimization One sequence gives one folding Bases are either single-stranded or paired with a single other base Matching interactions must not cross (i.e. pseudoknots are not allowed)
30
Base-pairs interactions nested disjoint crossing
31
Zuker’s algorithm Does not find our pseudoknots, even if the two stems are looked for separately
32
Seeking pseudoknots Rivas and Eddy 1999 –extends Zuker’s algorithm –accounts for pseudoknots using a more complex recursion (steep time and memory requirement) –does not work for our problem, probably due to lack of biological experiments to set the thermodynamical parameters
33
Orpheo Seeks stems separately with adequate parameters
34
Score matrix ATCG A -6 2-6-6 T 2-6-6 0 C -6-6-6 4 G -6 0 4-6
35
Smith-Waterman algorithm AGGACCT A000002 G00460 G0 1040 A00 2 C00 C0 T
36
Smith-Waterman algorithm AGGACCT A000002 G00460 G0 1040 A00 2 C00 C0 T AGGACCT GGAGGA CCTCCT A
37
Finding pseudoknots anyway Scores learnt on wild-type sequences –GC different from CG –GC score in stem 1 = #GC in stem 1 / stem 1 length Accounts for bulges and gaps Needs threshold to select relevant stems
38
Typical -1 frameshift site [Brierley, 1989] NNX XXY YYZAUG PSP S1 L1L1 S2S2 L2L2 L’1L’1 Slippery sequence Secondary structure 5’ 3’
39
Finding pseudoknots anyway 20 nt 50 nt S1.5’S1.3’ S1.5’S1.3’ HLfrom L2 S2.5’S2.3’
40
Orpheo Finds known sites Fast : 2 minutes on both strands of S. cerevisiae Distinguishes 5’ from 3’ and so implicitly accounts for triple interactions Yields around 200 candidates in yeast (including one with 13% efficiency)
41
Biological or random sequences Folded sequences In silico and in vivo validation Folding Wild-type folded sequences Folded and sorted sequences New FS sites Mutant folded sequences Score matrix RulesVoting Model
42
Example of a rule if SP length 5 and number of Gs in S1.5’ bottom half 3 and number of Gs in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ 75 or % G in S1.5' bottom half 80 and %C in L1 45 or SP length 5 and S1.3' length 6 and %C in S1.3' or SP length 5 and number of Gs in S1.5’ bottom half 3 and %C in S1.3’ 70 and %G in S2.3’ 45 or number of As in S1.5' = 0 and number of As in S2.3' = 0 then %FS 5
43
Biological or random sequences Folded sequences In silico and in vivo validation Folding Wild-type folded sequences Folded and sorted sequences New FS sites Mutant folded sequences Score matrix RulesVoting
44
Refining the model: Machine learning To identify relevant properties that characterize FS sites Disjunctive learning: all sequences do not frameshift for the same reasons [Giedroc et al., 2000] (or don’t they ? [Michiels et al. 2001])
45
Covering and prediction If SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 35 and %G in S1.5’ 75 then FS rate 5% Covering of examples: 70 % Examples predicted in test set:80 % Counterexamples in test set: 0 %
46
Search for protein patterns Goal: to find new frameshift sites outside the known consensus ORF 1 START 0 STOP 0 0 phase STOP -1 ORF 2 -1 phase known proteic patterns
47
Validation on random sequences Hypothesis : biologically relevant sequences have been selected and thus are not random If something is relevant, it is apart from the means
48
Examples of rule 1 SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ 75 70 % SP length 5 and number of G in S1.5’ bottom half 3 and %C in S1.5’ 45 and number of T in S2.5’ 1 80 % SP length 5 and S1.5' length 6 and number of G in S1.5’ 4 and number of T in S2.5' 1 and %C in S2.3’ 70 80 %
49
Examples of rule 1 SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ 75 70 % SP length 5 and number of G in S1.5’ bottom half 3 and %C in S1.5’ 45 and number of T in S2.5’ 1 80 % SP length 5 and S1.5' length 6 and number of G in S1.5’ 4 and number of T in S2.5' 1 and %C in S2.3’ 70 80 %
50
Experimental results published in Bioinformatics A COMPLETER
51
Conclusion and perspectives Spacer: –correlation between primary sequence and FS rate has been established –systematic experimentation going on Learning: –relevant rules –experimentation enriches data –quantitative approach (get real…)
52
Example of data IBV family= Coronaviridae genus= Coronavirus name= Infectious avian bronchitis virus gene1= ORF1a gene2= ORF1b article= Review Brierley 1995 wild type= yes modified part= none P= {UUUAAAC} SP= {GGGUAC} S1.5'= {GGGGUAGCAGU} L1= {G} S2.5'= {GAGGCUCG} L1'= {} S1.3'= {GCUGAUACCCC} L2={UUGCUAGUGGAUGUGAUCCUGAUGUUGUAAAG} S2.3'= {CGAGCCUU} S1= { stem1= GGGGTAGCAGT stem2= CCCCATAGTCG stability= -20,7 } S2= { stem1= GAGGCTCG stem2= TTCCGAGC stability= unknown } global stability= unknown definite secondary structure= yes L1.folding= no L1'.folding= no L2.folding= no efficiency= RRL 30% efficiency= XO 30%
53
Example of rules if SP length 5 and number of Gs in S1.5’ bottom half 3 and number of Gs in S1.5’ 4 and %T in S2.5’ 30 and %C in S2.3’ 75 or % G in S1.5' bottom half 80 and %C in L1 45 or SP length 5 and S1.3' length 6 and %C in S1.3' or SP length 5 and number of Gs in S1.5’ bottom half 3 and %C in S1.3’ 70 and %G in S2.3’ 45 or number of As in S1.5' = 0 and number of As in S2.3' = 0 then %FS 5
54
GloBo: main ideas Takes each example as a seed Agglomerates other examples in subset if least general generalization does not cover counterexamples Heuristically selects subsets to cover all examples
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.