Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
Introduction to ab initio and evidence-based gene finding
Annotation for D. virilis
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
Hidden Markov Models (HMM)
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
GEP Annotation Workflow
Eukaryotic Gene Finding
Ab initio gene prediction
Introduction to Bioinformatics II
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Genome Annotation and the Human Genome
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015

Outline Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors

Computational gene predictions Identify genes in genomic sequence Protein-coding genes Non-coding RNA genes Regulatory regions (enhancers, promoters) Predictions must be confirmed experimentally Eukaryotic gene predictions have high error rates Two major types of RefSeq records NM/NP = experimentally confirmed XM/XP = computational predictions

Primary goal of computational gene prediction algorithms Label each nucleotide in a genomic sequence Identify the most likely sequence of labels (i.e. optimal path) TTTCACACGTAAGTATAGTGTGTGA Sequence EEEEEEEESSIIIIIIIIIIIIIII Path 1 EEEEEEEEEEEESSIIIIIIIIIII Path 2 EEEEEEEEEEEEEEEEESSIIIIII Path 3 Exon (E)5’ Splice Site (S)Intron (I) Labels

Basic properties of gene prediction algorithms Predictions must satisfy a set of biological constraints Coding region must begin with a start codon Initial exon must occur before splice sites and introns Coding region must end with a stop codon Model these rules using a finite state machine (FSM) Use species-specific characteristics to improve the accuracy of gene predictions Distribution of exon and intron sizes Base frequencies ( e.g., GC content, codon bias)

Prokaryotic gene predictions Prokaryotes have relatively simple gene structure Single open reading frame Alternative start codons: AUG, GUG, UUG Gene finders can predict most prokaryotic genes accurately (> 90% sensitivity and specificity) Glimmer Salzberg S., et al., Microbial gene identification using interpolated Markov models, NAR. (1998) 26,

Eukaryotic gene predictions have high error rates Gene finders generally do a poor job (<50%) predicting genes in eukaryotes More variations in the gene models Alternative splicing (multiple isoforms) Non-canonical splice sites Alternate start codon ( e.g., Fmr1 ) Stop codon read through ( e.g., gish ) Nested genes ( e.g., ko ) Trans-splicing ( e.g., mod(mdg4) ) Pseudogenes

Types of eukaryotic gene predictors Ab initio GENSCAN, geneid, SNAP, GlimmerHMM Evidence-based (extrinsic) Augustus, genBlastG, Exonerate, GenomeScan Comparative genomics Twinscan/N-SCAN, SGP2 Combine ab initio and evidence-based approaches EVM, GLEAN, Gnomon, JIGSAW, MAKER

Ab initio gene prediction Ab initio = from the beginning Predict genes using only the genomic DNA sequence Search for signals of protein coding regions Typically based on a probabilistic model Hidden Markov Models (HMM) Support Vector Machines (SVM) GENSCAN Burge, C. and Karlin, S., Prediction of complete gene structures in human genomic DNA, JMB. (1997), 268, 78-94

Hidden Markov Models (HMM) A type of supervised machine learning algorithm Uses Bayesian statistics Makes classifications based on characteristics of training data Many types of applications Speech and gesture recognition Bioinformatics Gene predictions Sequence alignments ChIP-seq analysis Protein folding

Supervised machine learning Norvig, P. How to write a spelling corrector. Use aggregated data of previous search results to predict the search term and the correct spelling

GEP curriculum on HMM Developed by Anton Weisstein (Truman State University) and Zane Goodwin (TA for Bio 4342) Use a HMM to predict a splice donor site Use Excel to experiment with different emission and transition probabilities See the Beyond Annotation section of the GEP web site

GENSCAN HMM Model GENSCAN uses the following information to construct gene models: Promoter, splice site and polyadenylation signals Hexamer frequencies and base compositions Probability of coding and non-coding DNA Distribution of gene, exon and intron lengths Burge, C. and Karlin, S., Prediction of complete gene structures in human genomic DNA, JMB. (1997) 268, 78-94

Use multiple HMMs to describe different parts of a gene Stanke M., Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. (2003) 19 Suppl 2:ii

Evidence-based gene predictions Use sequence alignments to improve predictions EST, cDNA or protein from closely-related species Exon sensitivity : Percent of real exons identified Exon specificity : Percent of predicted exons that are correct Yeh RF, et al., Computational Inference of Homologous Gene Structures in the Human Genome, Genome Res. (2001) 11,

Augustus gene prediction service Stanke, M., et al., Using native and syntenically mapped cDNA alignments to improve de novo gene finding, Bioinformatics. (2008) 24,

Predictions using comparative genomics Use whole genome alignments from one or more informant species CONTRAST predicts 50% of genes correctly Requires high quality whole genome alignments and training data Flicek P, Gene prediction: compare and CONTRAST. Genome Biology 2007, 8, 233

Generate consensus gene models Gene predictors have different strengths and weaknesses Create consensus gene models by combining results from multiple gene finders and sequence alignments GLEAN Eisik CG, Mackey AJ, Reese JT, et al. Creating a honey bee consensus gene set. Genome Biology 2007, 8 :R13 GLEAN-R (reconciled) reference gene sets for 11 Drosophila species available at FlyBase

GLEAN-R prediction for the ey ortholog in D. erecta Single GLEAN-R prediction per genomic location Models have not been confirmed experimentally GLEAN-R RefSeq records have XM and XP prefixes

Automated annotation pipelines NCBI Gnomon gene prediction pipeline Integrate biological evidence into the predicted gene models Examples:Eannot NCBI Gnomon Ensembl UCSC Gene Build EGASP results for the Ensembl pipeline: 71.6% gene sensitivity 67.3% gene specificity

New Gnomon predictions for eight Drosophila species Based on RNA-Seq data from either the same or closely-related species D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis Predictions include untranslated regions and multiple isoforms Records not yet available through the NCBI RefSeq database

Final models Gene predictions Common problems with gene finders Split single gene into multiple predictions Fused with neighboring genes Missing exons Over predict exons or genes Missing isoforms

Non- canonical splice donors and acceptors Donor siteCount GC599 AT27 GA14 Acceptor siteCount AC33 TG28 AT17 Frequency of non-canonical splice sites in FlyBase Release 6.06 (Number of unique introns: 71,476) Many gene predictors only predict genes with canonical splice donor ( GT ) and acceptor ( AG ) sites Check Gene Record Finder or FlyBase for genes that use non-canonical splice sites in D. melanogaster

Annotate unusual features in gene models using D. melanogaster as a reference Examine the “ Comments on Gene Model ” section of the FlyBase Gene Report Non-canonical start codon: Stop codon read through:

Nested genes in Drosophila D. mel. D. erecta

Trans-spliced gene in Drosophila A special type of RNA processing where exons from two primary transcripts are ligated together

Gene prediction results for the GEP annotation projects Gene prediction results are available through the GEP UCSC Genome Browser mirror Under the Genes and Gene Prediction Tracks section Access the predicted peptide sequence by clicking on the feature, then click on the Predicted Protein link Original gene predictor output available inside the folder Genefinder in the annotation package The Genscan folder contains a PDF with a graphical schematic of the gene predictions

Summary Gene predictors can quickly identify potentially interesting features within a genomic sequence The predictions are hypotheses that must be confirmed experimentally Eukaryotic gene predictors generally can accurately identify internal exons Much lower sensitivity and specificity when predicting complete gene models

Questions?