Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA 94602
Reese, E-GASP The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Martin G. Reese Nomi L. Harris George Hartzell Suzanna E. Lewis Later added: Josep April Drosophila Genome Center Department of Molecular and Cell Biology 539 Life Sciences Addition University of California, Berkeley
Reese, E-GASP The genome annotation experiment “GASP” 1999 Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA44 separate regions Open to everybody, announced on several mailing lists Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods. “CASP” like 12 participating groupsEGASP at least 20 groups
Reese, E-GASP URL:
Reese, E-GASP Goals of the experiment Compare and contrast various genome annotation methods Objective assessment of the state of the art in gene finding and functional site prediction Identify outstanding problems in computational methods for the annotation process
Reese, E-GASP Adh contig 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions From chromosome 2L (34D-36A) Ashburner et al., (to appear in Genetics) 222 gene annotations (as of July 22, 1999) ~450 genes 375,585 bases are coding (12.95%) ENCODE region 30Mb We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.
Reese, E-GASP Adh paper (to appear in Genetics) URL:
Reese, E-GASP Submissions “MAGPIE” Team: T. Gaasterland et al. Computational Genomics Group, The Sanger Centre: V. Solovyev University of Erlangen: U. Ohler Genome Annotation Group, The Sanger Centre: E. Birney Oakridge Nat. Laboratory “GRAIL”: R. Mural et al. CBS Technical University of Denmark “HMMGene”: A. Krogh Georgia Institute of Technology “GeneMark.hmm”: M. Borodovsky IMIM, Spain “GeneID”: Roderic Guigó et al. Fred Hutchinson Cancer Center “BLOCKS”: Henikoff & Henikoff GSF, Neuherberg, Germany” M. Scherf Mount Sinai School of Medicine”: Gary Benson UCB/UC Santa Cruz/Neomorphic “Genie”: M. Reese and D. Kulp
Reese, E-GASP Submission classes
Reese, E-GASP Submission classes (cont.)
Reese, E-GASP Measuring success By nucleotide Sensitivity/Specificity (Sn/Sp) By exon Sn/Sp Missed exons (ME), wrong exons (WE) By gene Sn/Sp Missed genes (MG), wrong genes (WG) Average overlap statistics Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3),
Reese, E-GASP Definition: “Joined” and “split” genes JG > 1, tendency to join multiple actual genes into one prediction SG > 1, tendency to split actual genes into separate gene predictions Inspired by Hayes and Guigó (1999), unpublished. SG = # Predicted genes that overlap actual genes # Actual genes that overlap one or more predicted genes JG = # Actual genes that overlap predicted genes # Predicted genes that overlap one or more actual genes
Reese, E-GASP Results: Base level Sensitivity:Sn 93% “9_101_1” Low variability among predictors Sp 92% “20_79_1” ~95% coverage of the proteome Specificity ~90% Programs that are more like Genscan (used for original annotation) might do better?
Reese, E-GASP Results: Exon level Higher variability among predictorsSn 89.8% “14_87_3” Up to ~75% sensitivity (both exon boundaries correct) 55% specificity Sp 88% “20_78_3” Low specificity because partial exon overlaps do not count Missing exons below 5% Many wrong exons (~20%)
Reese, E-GASP Results: Gene level Sn 71% “36_46_1” Sp 66% “34_55_3”
Reese, E-GASP Results: Gene level 60% of actual genes predicted completely correct Specificity only 30-40% 5-10% missed genes (comparable to Sanger Center) 40% wrong genes, a lot of short genes overpredicted (possibly not annotated in Standard 3) Splitting genes is a bigger problem than joining genes Sn 71% “36_46_1” Sp 66% “34_55_3”
Reese, E-GASP DRO – Human comparison
Reese, E-GASP Results (protein homology): Gene level
Reese, E-GASP Discussion Good predictive improvements “expression” improves predictions “gene finding” became “automatic annotation” tools Gene sensitivity/specificity at roughly 70% is excellent No correct answer/real golden standard (like CASP) Superb community
Reese, E-GASP Open questions How many protein coding genes/loci missed? How many total human protein coding loci are there? (Dro <14,500) How much and what is the function of array detected transcripts (coding non-coding?) Can we get an exhaustive alternative splicing “golden standard”?