Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.

Slides:



Advertisements
Similar presentations
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
Advertisements

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
McPromoter – an ancient tool to predict transcription start sites
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Comparative ab initio prediction of gene structures using pair HMMs
Sónia Martins Bruno Martins José Cruz IGC, February 20 th, 2008.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
(CHAPTER 12- Brooker Text)
Eukaryotic Gene Finding
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Sequencing a genome and Basic Sequence Alignment
Hidden Markov Models In BioInformatics
Comparative Biology observable Parameters:time rates, selection Unobservable Evolutionary Path observable Most Recent Common Ancestor ? ATTGCGTATATAT….CAG.
RNA Secondary Structure What is RNA? Definition of RNA secondary Structure RNA molecule evolution Algorithms for base pair maximisation Chomsky’s Linguistic.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) The Parsimony criterion GKN Stochastic Models of.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Sequencing a genome and Basic Sequence Alignment
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Hidden Markov Models in Bioinformatics
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
What is Bioinformatics?
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
GENE REGULATION RESULTS IN DIFFERENTIAL GENE EXPRESSION, LEADING TO CELL SPECIALIZATION Eukaryotic DNA.
Motif Search and RNA Structure Prediction Lesson 9.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
(H)MMs in gene prediction and similarity searches.
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Stochastic Context-Free Grammars for Modeling RNA
Eukaryotic Gene Finding
Stochastic Context-Free Grammars for Modeling RNA
Catalogues, Homology & Molecular Evolution.
Ab initio gene prediction
Introduction to Bioinformatics II
Combining RNA and Protein selection models
Hidden Markov Models in Bioinformatics min
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Gene Structure.
Gene Structure.
Presentation transcript:

Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars

Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence.....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctct gttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccg gagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgac gtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggg gaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtg ttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgat ctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacag accaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgta cctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3' 5' 3' Intron ExonUTR and intergenic sequence Output:

Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Homologous Genomes Evolution of Feature – regulatory signals > RNA > protein Knowledge and annotation transfer – experimental knowledge might be present in other species Integration of levels – RNA structure of mRNA, signals in coding regions,.. Further complications Combining with non-homologous analysis – tests for common regulation. Levels of Annotation Combining specie and population perspective “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Selection Strength,… A T T C A A T T C A Epigenomics – methylation, histone modification

Observables, Hidden Variables, Evolution & Knowledge Evolution Hidden Variable Knowledge (Constraints) Observables x If knowledge deterministic

Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 Pedersen, Meyer, Forsberg…, Simmonds 2004a,b McCauley …. Firth & Brown Conditional Modelling Needs: Footprinting -Signals (Blanchette) AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG}

Grammars: Finite Set of Rules for Generating Strings Regular finished – no variables General (also erasing) Context Free Context Sensitive Ordinary letters: & Variables: i.A starting symbol: ii. A set of substitution rules applied to variables in the present string:

Simple String Generators Context Free Grammar  S--> aSa bSb aa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba Variables (capital) Letters (small) Regular Grammar: Start with S S --> aT bS T --> aS bT  One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba Regular Context Free

Stochastic Grammars The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.3)aS (0.4)bT (0.3)  If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. ii.  S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb S -> aT -> aaS –> aabS -> aabaT -> aaba *0.3 *0.7 *0.3 S -> aSa -> abSba -> abaaba *0.3*0.5 *0.1

Hidden Markov Models in Bioinformatics Definition Four Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Optimizing Parameters O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 H1H1 H2H2 H3H3

What is the probability of the data? O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 H1H1 H2H2 The probability of the observed is, which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O 1,..O k ) conditional on H k =j. Following recursion will be obeyed:

HMM Examples Simple Eukaryotic Simple Prokaryotic Gene Finding: Burge and Karlin, 1996 Intron length > 50 bp required for splicing Length distribution is not geometric

Genscan Exons of phase 0, 1 or 2 Initial exon Terminal exon Introns of phase 0, 1 or 2 Exon of single exon genes 5' UTR Promoter Poly-A signal 3' UTR Intergenic sequence State with length distribution Omitted: reverse strand part of the HMM

AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG} Comparative Gene Annotation

SCFG Analogue to HMM calculations HMM/Stochastic Regular Grammar: O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 H1H1 H2H2 H3H3 W i j 1 L SCFG - Stochastic Context Free Grammars: WLWL WRWR i’i’j’j’

S --> LS L F --> dFd LS L --> s dFd Secondary Structure Generators

Knudsen & Hein, 2003 From Knudsen et al. (1999) RNA Structure Application

Observing Evolution has 2 parts P(Further history of x): U C G A C A U A C P(x): x