Reese, E-GASP 2005 1 Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc. 5980 Horton Street Emeryville, CA 94602.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
McPromoter – an ancient tool to predict transcription start sites
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Comparative ab initio prediction of gene structures using pair HMMs
R ESEARCH G ENOME B IOINFORMATICS L AB R ESEARCH at G ENOME B IOINFORMATICS L AB Josep F. Abril Ferrando and Genís Parra Farré Genome BioInformatics Research.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Eukaryotic Gene Finding
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Eukaryotic Gene Finding
Anum kamal(BB ) Umm-e-Habiba(BB ). Gene splicing “Gene splicing is the removal of introns from the primary trascript of a discontinuous gene.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Wfleabase.org/docs/tilexseq0904.pdf What is all this genome expression? Observations and statistics for expression at the base level April 2009Don Gilbert.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Bioinformatics and Computational Biology
De novo assembly validation
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
Applied Bioinformatics
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
Accessing and visualizing genomics data
Starter What do you know about DNA and gene expression?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
TRANSCRIPTION AND TRANSLATION Vocabulary. GENE EXPRESSION the appearance in a phenotype characteristic or effect attributed to a particular gene.
Considerations for multi-omics data integration Michael Tress CNIO,
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Visualizing Biosciences Genomics & Proteomics. “Scientists Complete Rough Draft of Human Genome” - New York Times, June 26, 2000 The problem: –3 billion.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Martin G. Reese Nomi L. Harris George Hartzell Suzanna E. Lewis
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
EL: To find out what a genome is and how gene expression is regulated
Chapter 4 “DNA Finger Printing”
Genomes and Their Evolution
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA 94602

Reese, E-GASP The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Martin G. Reese Nomi L. Harris George Hartzell Suzanna E. Lewis Later added: Josep April Drosophila Genome Center Department of Molecular and Cell Biology 539 Life Sciences Addition University of California, Berkeley

Reese, E-GASP The genome annotation experiment “GASP” 1999 Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA44 separate regions Open to everybody, announced on several mailing lists Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods. “CASP” like 12 participating groupsEGASP at least 20 groups

Reese, E-GASP URL:

Reese, E-GASP Goals of the experiment Compare and contrast various genome annotation methods Objective assessment of the state of the art in gene finding and functional site prediction Identify outstanding problems in computational methods for the annotation process

Reese, E-GASP Adh contig 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions  From chromosome 2L (34D-36A)  Ashburner et al., (to appear in Genetics)  222 gene annotations (as of July 22, 1999) ~450 genes  375,585 bases are coding (12.95%) ENCODE region 30Mb We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.

Reese, E-GASP Adh paper (to appear in Genetics) URL:

Reese, E-GASP Submissions “MAGPIE” Team: T. Gaasterland et al. Computational Genomics Group, The Sanger Centre: V. Solovyev University of Erlangen: U. Ohler Genome Annotation Group, The Sanger Centre: E. Birney Oakridge Nat. Laboratory “GRAIL”: R. Mural et al. CBS Technical University of Denmark “HMMGene”: A. Krogh Georgia Institute of Technology “GeneMark.hmm”: M. Borodovsky IMIM, Spain “GeneID”: Roderic Guigó et al. Fred Hutchinson Cancer Center “BLOCKS”: Henikoff & Henikoff GSF, Neuherberg, Germany” M. Scherf Mount Sinai School of Medicine”: Gary Benson UCB/UC Santa Cruz/Neomorphic “Genie”: M. Reese and D. Kulp

Reese, E-GASP Submission classes

Reese, E-GASP Submission classes (cont.)

Reese, E-GASP Measuring success By nucleotide  Sensitivity/Specificity (Sn/Sp) By exon  Sn/Sp  Missed exons (ME), wrong exons (WE) By gene  Sn/Sp  Missed genes (MG), wrong genes (WG)  Average overlap statistics Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3),

Reese, E-GASP Definition: “Joined” and “split” genes JG > 1, tendency to join multiple actual genes into one prediction SG > 1, tendency to split actual genes into separate gene predictions Inspired by Hayes and Guigó (1999), unpublished. SG = # Predicted genes that overlap actual genes # Actual genes that overlap one or more predicted genes JG = # Actual genes that overlap predicted genes # Predicted genes that overlap one or more actual genes

Reese, E-GASP Results: Base level Sensitivity:Sn 93% “9_101_1”  Low variability among predictors Sp 92% “20_79_1”  ~95% coverage of the proteome Specificity  ~90%  Programs that are more like Genscan (used for original annotation) might do better?

Reese, E-GASP Results: Exon level Higher variability among predictorsSn 89.8% “14_87_3” Up to ~75% sensitivity (both exon boundaries correct) 55% specificity Sp 88% “20_78_3” Low specificity because partial exon overlaps do not count Missing exons below 5% Many wrong exons (~20%)

Reese, E-GASP Results: Gene level Sn 71% “36_46_1” Sp 66% “34_55_3”

Reese, E-GASP Results: Gene level 60% of actual genes predicted completely correct Specificity only 30-40% 5-10% missed genes (comparable to Sanger Center) 40% wrong genes, a lot of short genes overpredicted (possibly not annotated in Standard 3) Splitting genes is a bigger problem than joining genes Sn 71% “36_46_1” Sp 66% “34_55_3”

Reese, E-GASP DRO – Human comparison

Reese, E-GASP Results (protein homology): Gene level

Reese, E-GASP Discussion Good predictive improvements “expression” improves predictions “gene finding” became “automatic annotation” tools Gene sensitivity/specificity at roughly 70% is excellent No correct answer/real golden standard (like CASP) Superb community

Reese, E-GASP Open questions How many protein coding genes/loci missed? How many total human protein coding loci are there? (Dro <14,500) How much and what is the function of array detected transcripts (coding non-coding?) Can we get an exhaustive alternative splicing “golden standard”?