Download presentation
Presentation is loading. Please wait.
Published byEzra Cross Modified over 9 years ago
1
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars
2
Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence.....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctct gttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccg gagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgac gtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggg gaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtg ttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgat ctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacag accaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgta cctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3' 5' 3' Intron ExonUTR and intergenic sequence Output:
3
Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Homologous Genomes Evolution of Feature – regulatory signals > RNA > protein Knowledge and annotation transfer – experimental knowledge might be present in other species Integration of levels – RNA structure of mRNA, signals in coding regions,.. Further complications Combining with non-homologous analysis – tests for common regulation. Levels of Annotation Combining specie and population perspective “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Selection Strength,… A T T C A A T T C A Epigenomics – methylation, histone modification
4
Observables, Hidden Variables, Evolution & Knowledge Evolution Hidden Variable Knowledge (Constraints) Observables x If knowledge deterministic
5
Genscan Exons of phase 0, 1 or 2 Initial exon Terminal exon Introns of phase 0, 1 or 2 Exon of single exon genes 5' UTR Promoter Poly-A signal 3' UTR Intergenic sequence State with length distribution Omitted: reverse strand part of the HMM
6
AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG} Comparative Gene Annotation
7
Gene Finding & Protein Homology (Gelfand, Mironov & Pevzner, 1996) Spliced Alignment : 1. Define set of potential exons in new genome. 2. Make exon ordering graph - EOG. 3. Align EOG to protein database. Protein Database Exon Ordering Graph T Y G H L P L P M T Y G H L P T Y - - L P M T W Y Q
8
Simultaneous Alignment & Gene Finding Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002. Align by minimizing Distance/ Maximizing Similarity: Align genes with structure Known/unknown:
9
S --> LS L.869.131 F --> dFd LS.788.212 L --> s dFd.895.105 Secondary Structure Generators
10
Knudsen & Hein, 2003 From Knudsen et al. (1999) RNA Structure Application
11
Observing Evolution has 2 parts P(Further history of x): P(x): x http://www.stats.ox.ac.uk/research/genome/projects/currentprojects U C G A C A U A C
12
Hidden Markov Model for Overlapping Genes Scanning Only starts in AUG (0.06) Will Stop in “STOP” (1.0) TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Virus genome 1 st reading frame 2 nd reading frame 3 rd reading frame TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Hidden States Annotation NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3 NC 1 2 3 1,2 1,3 2,3 1,2,3
13
Molecular Evolution: Known Reading Frames Selection rates on rates Assume multiplicativity of selection factors AG T C T Known fixed context throughout phylogeny Simplify Genetic Code: 4-fold 2-fold (1-1-1-1) 1-1-1-1 sites 2-2 4 1-1-1-1 2-2 4 (f 1 f 2 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (f 2 a, f 2 b) (f 1 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (a, f 2 b) (f 1 a, f 1 b) (a, f 1 b) (a, b) 1 st 2 nd
14
Un-known Reading Frames and varying selection. Selection Levels Coding Status 1 2 3 8 a (.95) (1-a)/7 Selection Levels 0.01 0.1 0.2 0.4 0.6 1.5 2.0 0.8 sequences k 1 AG T C T Coding Status AG T C T T C G Selection Levels Coding Status
15
GAG POL REV VPX TAT NEF VIFVPRENV HIV2 of 14 genomes: Evolution/Selection ParameterEstimate+/- 1.96 Error Transition5.790.19 Transversion1.030.05 Base SF0.730.06 SF STOP0.440.18 0.950.02 B.Selection Strengths for Genes and Positions A. Phylogeny and Evolutionary Parameters.
16
Gag Pol Vif Vpx Env Nef Vpr Tat Rev GenBank Single Sequence Phylo-HMM HIV2 of 14 genomes: Annotation Sensitivity: 0.9308 Specificity: 0.9939 LogLikelihood: -34939.32 ViterbiCont.:-34949.41 Sensitivity: 0.9542 Specificity: 0.9965 LogLikelihood: -75939.18 ViterbiCont.:--75945.77
17
= ATG Same evolutionary model as before, but different HMM topology 64 states 3 different types of transitions HMM extension: Stop/Start Skidding
18
de novo annotation: 81.5% sensitivity (without non- homologous genes) 98.5% specificity = 0.23 = 0.06 = 0.71 Knowing HIV1 (fixing the Viterbi path for one cube): 97.6% sensitivity (without non- homologous genes) 99.9% specificity Annotation Results: HIV1 vs. HIV2
19
HMM Extension II: Introns Introns will almost always be 3k long 27 states 729 states Pair HMM Single Sequence HMM
20
Conserved RNA Structure in Protein Coding Genes Problem: Gene Structure Known, RNA Structure Unknown. Genome: Exons: RNA Structure: Protein-RNA Evolution: DoubletsSinglet Contagious Dependence
21
RNA + Protein Evolution Codon Nucleotide Independence Heuristic Singlet R i,j =f* q i,j Doublet R (i1,i2),(j1,j2) = f 1 * f 2 * q (i1,i2),(j1,j2) Non-structural Structural Structure/non-Structure Grammars Prediction of stem-paring regions for different number of sequences 8 5 3
22
Combining Grammars: Multiple Hidden Layers Present Approach: Two “independent” annotations SCFG: RNA Structure HMM: Protein Structure Ideal Approach: Combined Annotation Combine SCFG & HMM: RNA, Gene Structure Joanna Davies
23
Combining Grammars: Solution Attempts Independence is non-trivial to define as they in principle are competing alternative models. HMM SCFG Let X be the stochastic variable giving the HMM annotation. Let Y be the stochastic variable giving the SCFG annotation. Is No. Combined Grammars (HMM, SCGF) --> SCFG have been devised, but does not work well, have arbitrary designs and are very large. Combinations of Viterbi and Posterior Decoding arises. Joanna Davies
24
http://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.