Eukaryotic Gene Finding

Slides:



Advertisements
Similar presentations
Ab initio gene prediction Genome 559, Winter 2011.
Advertisements

SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?
Computational Gene Finding using HMMs
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Computational Gene Finding
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding (DNA signals) Genome Sequencing and assembly
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Leming Zhou, PhD School of Health and Rehabilitation Sciences
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Day 2! Chapter 15 Eukaryotic Gene Regulation Almost all the cells in an organism are genetically identical. Differences between cell types result from.
Gene Structure and Identification
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Review of Protein Synthesis. Fig TRANSCRIPTION TRANSLATION DNA mRNA Ribosome Polypeptide (a) Bacterial cell Nuclear envelope TRANSCRIPTION RNA PROCESSING.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
Using DNA Subway in the Classroom Genome Annotation: Red Line.
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
”Gene Finding in Eukaryotic Genomes”
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
Hidden Markov Model I.
Genes, Genomes, and Genomics
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
4. HMMs for gene finding HMM Ability to model grammar
Genome Annotation and the Human Genome
Gene Structure.
Gene Structure.
Presentation transcript:

Eukaryotic Gene Finding Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

Prokaryotic vs. Eukaryotic Genes Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters polyadenylation

Pre-mRNA Splicing ... ... U 1 s n R N P 2 intronic repressor 5 ’ splice signal U 2 A F 6 5 3 1 s n R N P SR proteins intron definition exon definition exonic enhancers 5 ’ splice signal 3 polyY branch signal intronic enhancers exonic repressor ... (assembly of spliceosome, catalysis) ...

Some Statistics On average, a vertebrate gene is about 30KB long Coding region takes about 1KB Exon sizes can vary from double digit numbers to kilobases An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp but both can be much longer.

Human Splice Signal Motifs

Semi-Markov HMM Model

Genscan HSMM

GenScan States N - intergenic region P - promoter F - 5’ untranslated region Esngl – single exon (intronless) (translation start -> stop codon) Einit – initial exon (translation start -> donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon) Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon

GenScan features Model both strands at once Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Advanced splice site modeling Parameters learned from annotated genes Separate parameter training for different CpG content groups

GenScan Signal Modeling PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn) PolyA signal Translation initiation/termination signal Promoters WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1) 5’ and 3’ splice sites

HMM-based Gene Finding GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

GenomeScan proteins are available. Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan) Focus on ‘typical case’ when homologous but not identical proteins are available.

GeneWise [Birney, Amitai] Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA GeneWise algorithm aligns a profile HMM directly to the DNA

Sample GeneWise Output

Developing GeneWise Model Start with a PFAM domain HMM Replace AA emissions with codon emissions Allow for sequencing errors (deletions/insertions) Add a 3-state intron model

GeneWise Model

GeneWise Intron Model PY tract central spacer 5’ site 3’ site

GeneWise Model Viterbi algorithm -> “best” alignment of DNA to protein domain Alignment gives exact exon-intron boundaries Parameters learned from species-specific statistics

GeneWise problems Only provides partial prediction, and only where the homology lies Does not find “more” genes Pseudogenes, Retrotransposons picked up CPU intensive Solution: Pre-filter with BLAST

Summary Genes are complex structures which are difficult to predict with the required level of accuracy/confidence Different approaches to gene finding: Ab Initio : GenScan Ab Initio modified by BLAST homologies: GenomeScan Homology guided: GeneWise