Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Towards realistic codon models: among site variability and dependency of synonymous and nonsynonymous rates Itay Mayrose Adi Doron-Faigenboim Eran Bacharach.
Ab initio gene prediction Genome 559, Winter 2011.
Profiles for Sequences
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Finding Eukaryotic Open reading frames.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
Gene Identification Lab
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Lecture 12 Splicing and gene prediction in eukaryotes
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Hidden Markov Models In BioInformatics
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription Vocabulary of transcription: transcription - synthesis of RNA under the direction of DNA messenger RNA (mRNA) - carries genetic message from.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Motif Search and RNA Structure Prediction Lesson 9.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
RNA, Transcription, and the Genetic Code. RNA = ribonucleic acid -Nucleic acid similar to DNA but with several differences DNARNA Number of strands21.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Gene Expression : Transcription and Translation 3.4 & 7.3.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
Genome Annotation (protein coding genes)
Eukaryotic Gene Structure
Interpolated Markov Models for Gene Finding
Eukaryotic Gene Finding
Catalogues, Homology & Molecular Evolution.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars

Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence.....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctct gttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccg gagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgac gtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggg gaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtg ttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgat ctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacag accaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgta cctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3' 5' 3' Intron ExonUTR and intergenic sequence Output:

Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Homologous Genomes Evolution of Feature – regulatory signals > RNA > protein Knowledge and annotation transfer – experimental knowledge might be present in other species Integration of levels – RNA structure of mRNA, signals in coding regions,.. Further complications Combining with non-homologous analysis – tests for common regulation. Levels of Annotation Combining specie and population perspective “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Selection Strength,… A T T C A A T T C A Epigenomics – methylation, histone modification

Observables, Hidden Variables, Evolution & Knowledge Evolution Hidden Variable Knowledge (Constraints) Observables x If knowledge deterministic

Genscan Exons of phase 0, 1 or 2 Initial exon Terminal exon Introns of phase 0, 1 or 2 Exon of single exon genes 5' UTR Promoter Poly-A signal 3' UTR Intergenic sequence State with length distribution Omitted: reverse strand part of the HMM

AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG} Comparative Gene Annotation

Gene Finding & Protein Homology (Gelfand, Mironov & Pevzner, 1996) Spliced Alignment : 1. Define set of potential exons in new genome. 2. Make exon ordering graph - EOG. 3. Align EOG to protein database. Protein Database Exon Ordering Graph T Y G H L P L P M T Y G H L P T Y - - L P M T W Y Q

Simultaneous Alignment & Gene Finding Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002. Align by minimizing Distance/ Maximizing Similarity: Align genes with structure Known/unknown:

S --> LS L F --> dFd LS L --> s dFd Secondary Structure Generators

Knudsen & Hein, 2003 From Knudsen et al. (1999) RNA Structure Application

Observing Evolution has 2 parts P(Further history of x): P(x): x U C G A C A U A C

Hidden Markov Model for Overlapping Genes Scanning Only starts in AUG (0.06) Will Stop in “STOP” (1.0) TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Virus genome 1 st reading frame 2 nd reading frame 3 rd reading frame TC [1,2,3] D [1,2] D [3,1] S [1] D [2,3] S [2] S [3] NC Hidden States Annotation NC ,2 1,3 2,3 1,2,3 NC ,2 1,3 2,3 1,2,3 NC ,2 1,3 2,3 1,2,3 NC ,2 1,3 2,3 1,2,3

Molecular Evolution: Known Reading Frames Selection rates on rates Assume multiplicativity of selection factors AG T C T Known fixed context throughout phylogeny Simplify Genetic Code: 4-fold 2-fold ( ) sites (f 1 f 2 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (f 2 a, f 2 b) (f 1 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (a, f 2 b) (f 1 a, f 1 b) (a, f 1 b) (a, b) 1 st 2 nd

Un-known Reading Frames and varying selection. Selection Levels Coding Status a (.95) (1-a)/7 Selection Levels sequences k 1 AG T C T Coding Status AG T C T T C G Selection Levels Coding Status

GAG POL REV VPX TAT NEF VIFVPRENV HIV2 of 14 genomes: Evolution/Selection ParameterEstimate+/ Error Transition Transversion Base SF SF STOP  B.Selection Strengths for Genes and Positions A. Phylogeny and Evolutionary Parameters.

Gag Pol Vif Vpx Env Nef Vpr Tat Rev GenBank Single Sequence Phylo-HMM HIV2 of 14 genomes: Annotation Sensitivity: Specificity: LogLikelihood: ViterbiCont.: Sensitivity: Specificity: LogLikelihood: ViterbiCont.:

= ATG Same evolutionary model as before, but different HMM topology 64 states 3 different types of transitions HMM extension: Stop/Start Skidding

de novo annotation: 81.5% sensitivity (without non- homologous genes) 98.5% specificity  = 0.23  = 0.06  = 0.71 Knowing HIV1 (fixing the Viterbi path for one cube): 97.6% sensitivity (without non- homologous genes) 99.9% specificity Annotation Results: HIV1 vs. HIV2

HMM Extension II: Introns Introns will almost always be 3k long 27 states 729 states Pair HMM Single Sequence HMM

Conserved RNA Structure in Protein Coding Genes Problem: Gene Structure Known, RNA Structure Unknown. Genome: Exons: RNA Structure: Protein-RNA Evolution: DoubletsSinglet Contagious Dependence

RNA + Protein Evolution Codon Nucleotide Independence Heuristic Singlet R i,j =f* q i,j Doublet R (i1,i2),(j1,j2) = f 1 * f 2 * q (i1,i2),(j1,j2) Non-structural Structural Structure/non-Structure Grammars Prediction of stem-paring regions for different number of sequences 8 5 3

Combining Grammars: Multiple Hidden Layers Present Approach: Two “independent” annotations SCFG: RNA Structure HMM: Protein Structure Ideal Approach: Combined Annotation Combine SCFG & HMM: RNA, Gene Structure Joanna Davies

Combining Grammars: Solution Attempts Independence is non-trivial to define as they in principle are competing alternative models. HMM SCFG Let X be the stochastic variable giving the HMM annotation. Let Y be the stochastic variable giving the SCFG annotation. Is No. Combined Grammars (HMM, SCGF) --> SCFG have been devised, but does not work well, have arbitrary designs and are very large. Combinations of Viterbi and Posterior Decoding arises. Joanna Davies