Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.

Similar presentations


Presentation on theme: "Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure."— Presentation transcript:

1 Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars

2 Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence.....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctct gttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccg gagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgac gtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggg gaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtg ttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgat ctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacag accaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgta cctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3' 5' 3' Intron ExonUTR and intergenic sequence Output:

3 Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Homologous Genomes Evolution of Feature – regulatory signals > RNA > protein Knowledge and annotation transfer – experimental knowledge might be present in other species Integration of levels – RNA structure of mRNA, signals in coding regions,.. Further complications Combining with non-homologous analysis – tests for common regulation. Levels of Annotation Combining specie and population perspective “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Selection Strength,… A T T C A A T T C A Epigenomics – methylation, histone modification

4 Observables, Hidden Variables, Evolution & Knowledge Evolution Hidden Variable Knowledge (Constraints) Observables x If knowledge deterministic

5 Genscan Exons of phase 0, 1 or 2 Initial exon Terminal exon Introns of phase 0, 1 or 2 Exon of single exon genes 5' UTR Promoter Poly-A signal 3' UTR Intergenic sequence State with length distribution Omitted: reverse strand part of the HMM

6 AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG} Comparative Gene Annotation

7 Gene Finding & Protein Homology (Gelfand, Mironov & Pevzner, 1996) Spliced Alignment : 1. Define set of potential exons in new genome. 2. Make exon ordering graph - EOG. 3. Align EOG to protein database. Protein Database Exon Ordering Graph T Y G H L P L P M T Y G H L P T Y - - L P M T W Y Q

8 Simultaneous Alignment & Gene Finding Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002. Align by minimizing Distance/ Maximizing Similarity: Align genes with structure Known/unknown:

9 S --> LS L.869.131 F --> dFd LS.788.212 L --> s dFd.895.105 Secondary Structure Generators

10 Knudsen & Hein, 2003 From Knudsen et al. (1999) RNA Structure Application

11 Observing Evolution has 2 parts P(Further history of x): P(x): x http://www.stats.ox.ac.uk/research/genome/projects/currentprojects U C G A C A U A C

12 Hidden Markov Model for Overlapping Genes Scanning Only starts in AUG (0.06) Will Stop in “STOP” (1.0) TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Virus genome 1 st reading frame 2 nd reading frame 3 rd reading frame TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Hidden States Annotation NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3

13 Molecular Evolution: Known Reading Frames Selection rates on rates Assume multiplicativity of selection factors AG T C T Known fixed context throughout phylogeny Simplify Genetic Code: 4-fold 2-fold (1-1-1-1) 1-1-1-1 sites 2-2 4 1-1-1-1 2-2 4 (f 1 f 2 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (f 2 a, f 2 b) (f 1 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (a, f 2 b) (f 1 a, f 1 b) (a, f 1 b) (a, b) 1 st 2 nd

14 Un-known Reading Frames and varying selection. Selection Levels Coding Status 1 2 3 8 a (.95) (1-a)/7 Selection Levels 0.01 0.1 0.2 0.4 0.6 1.5 2.0 0.8 sequences k 1 AG T C T Coding Status AG T C T T C G Selection Levels Coding Status

15 GAG POL REV VPX TAT NEF VIFVPRENV HIV2 of 14 genomes: Evolution/Selection ParameterEstimate+/- 1.96 Error Transition5.790.19 Transversion1.030.05 Base SF0.730.06 SF STOP0.440.18  0.950.02 B.Selection Strengths for Genes and Positions A. Phylogeny and Evolutionary Parameters.

16 Gag Pol Vif Vpx Env Nef Vpr Tat Rev GenBank Single Sequence Phylo-HMM HIV2 of 14 genomes: Annotation Sensitivity: 0.9308 Specificity: 0.9939 LogLikelihood: -34939.32 ViterbiCont.:-34949.41 Sensitivity: 0.9542 Specificity: 0.9965 LogLikelihood: -75939.18 ViterbiCont.:--75945.77

17 = ATG Same evolutionary model as before, but different HMM topology 64 states 3 different types of transitions HMM extension: Stop/Start Skidding

18 de novo annotation: 81.5% sensitivity (without non- homologous genes) 98.5% specificity  = 0.23  = 0.06  = 0.71 Knowing HIV1 (fixing the Viterbi path for one cube): 97.6% sensitivity (without non- homologous genes) 99.9% specificity Annotation Results: HIV1 vs. HIV2

19 HMM Extension II: Introns Introns will almost always be 3k long 27 states 729 states Pair HMM Single Sequence HMM

20 Conserved RNA Structure in Protein Coding Genes Problem: Gene Structure Known, RNA Structure Unknown. Genome: Exons: RNA Structure: Protein-RNA Evolution: DoubletsSinglet Contagious Dependence

21 RNA + Protein Evolution Codon Nucleotide Independence Heuristic Singlet R i,j =f* q i,j Doublet R (i1,i2),(j1,j2) = f 1 * f 2 * q (i1,i2),(j1,j2) Non-structural Structural Structure/non-Structure Grammars Prediction of stem-paring regions for different number of sequences 8 5 3

22 Combining Grammars: Multiple Hidden Layers Present Approach: Two “independent” annotations SCFG: RNA Structure HMM: Protein Structure Ideal Approach: Combined Annotation Combine SCFG & HMM: RNA, Gene Structure Joanna Davies

23 Combining Grammars: Solution Attempts Independence is non-trivial to define as they in principle are competing alternative models. HMM SCFG Let X be the stochastic variable giving the HMM annotation. Let Y be the stochastic variable giving the SCFG annotation. Is No. Combined Grammars (HMM, SCGF) --> SCFG have been devised, but does not work well, have arbitrary designs and are very large. Combinations of Viterbi and Posterior Decoding arises. Joanna Davies

24 http://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdf


Download ppt "Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure."

Similar presentations


Ads by Google