Download presentation
Presentation is loading. Please wait.
1
Reese, E-GASP 2005 1 Short comparion GASP ‘99- EGASP ‘05 Martin Reese (mreese@omicia.com Omicia Inc. 5980 Horton Street Emeryville, CA 94602
2
Reese, E-GASP 2005 2 The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Martin G. Reese (mgreese@lbl.gov) Nomi L. Harris (nlharris@lbl.gov) George Hartzell (hartzell@cs.berkeley.edu) Suzanna E. Lewis (suzi@fruitfly.berkeley.edu)suzi@fruitfly.berkeley.edu Later added: Josep April Drosophila Genome Center Department of Molecular and Cell Biology 539 Life Sciences Addition University of California, Berkeley
3
Reese, E-GASP 2005 3 The genome annotation experiment “GASP” 1999 Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA44 separate regions Open to everybody, announced on several mailing lists Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods. “CASP” like 12 participating groupsEGASP at least 20 groups
4
Reese, E-GASP 2005 4 URL: http://www-hgc.lbl.gov/homes/reese/genome-annotation
5
Reese, E-GASP 2005 5 Goals of the experiment Compare and contrast various genome annotation methods Objective assessment of the state of the art in gene finding and functional site prediction Identify outstanding problems in computational methods for the annotation process
6
Reese, E-GASP 2005 6 Adh contig 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions From chromosome 2L (34D-36A) Ashburner et al., (to appear in Genetics) 222 gene annotations (as of July 22, 1999) ~450 genes 375,585 bases are coding (12.95%) ENCODE region 30Mb We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.
7
Reese, E-GASP 2005 7 Adh paper (to appear in Genetics) URL: http://www.fruitfly.org/publications/PDF/ADH.pdf
8
Reese, E-GASP 2005 8 Submissions “MAGPIE” Team: T. Gaasterland et al. Computational Genomics Group, The Sanger Centre: V. Solovyev University of Erlangen: U. Ohler Genome Annotation Group, The Sanger Centre: E. Birney Oakridge Nat. Laboratory “GRAIL”: R. Mural et al. CBS Technical University of Denmark “HMMGene”: A. Krogh Georgia Institute of Technology “GeneMark.hmm”: M. Borodovsky IMIM, Spain “GeneID”: Roderic Guigó et al. Fred Hutchinson Cancer Center “BLOCKS”: Henikoff & Henikoff GSF, Neuherberg, Germany” M. Scherf Mount Sinai School of Medicine”: Gary Benson UCB/UC Santa Cruz/Neomorphic “Genie”: M. Reese and D. Kulp
9
Reese, E-GASP 2005 9 Submission classes
10
Reese, E-GASP 2005 10 Submission classes (cont.)
11
Reese, E-GASP 2005 11 Measuring success By nucleotide Sensitivity/Specificity (Sn/Sp) By exon Sn/Sp Missed exons (ME), wrong exons (WE) By gene Sn/Sp Missed genes (MG), wrong genes (WG) Average overlap statistics Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3), 353-367.
12
Reese, E-GASP 2005 12 Definition: “Joined” and “split” genes JG > 1, tendency to join multiple actual genes into one prediction SG > 1, tendency to split actual genes into separate gene predictions Inspired by Hayes and Guigó (1999), unpublished. SG = ------------------------------------------- # Predicted genes that overlap actual genes # Actual genes that overlap one or more predicted genes JG = ------------------------------------------- # Actual genes that overlap predicted genes # Predicted genes that overlap one or more actual genes
13
Reese, E-GASP 2005 13 Results: Base level Sensitivity:Sn 93% “9_101_1” Low variability among predictors Sp 92% “20_79_1” ~95% coverage of the proteome Specificity ~90% Programs that are more like Genscan (used for original annotation) might do better?
14
Reese, E-GASP 2005 14 Results: Exon level Higher variability among predictorsSn 89.8% “14_87_3” Up to ~75% sensitivity (both exon boundaries correct) 55% specificity Sp 88% “20_78_3” Low specificity because partial exon overlaps do not count Missing exons below 5% Many wrong exons (~20%)
15
Reese, E-GASP 2005 15 Results: Gene level Sn 71% “36_46_1” Sp 66% “34_55_3”
16
Reese, E-GASP 2005 16 Results: Gene level 60% of actual genes predicted completely correct Specificity only 30-40% 5-10% missed genes (comparable to Sanger Center) 40% wrong genes, a lot of short genes overpredicted (possibly not annotated in Standard 3) Splitting genes is a bigger problem than joining genes Sn 71% “36_46_1” Sp 66% “34_55_3”
17
Reese, E-GASP 2005 17 DRO – Human comparison
18
Reese, E-GASP 2005 18 Results (protein homology): Gene level
19
Reese, E-GASP 2005 19 Discussion Good predictive improvements “expression” improves predictions “gene finding” became “automatic annotation” tools Gene sensitivity/specificity at roughly 70% is excellent No correct answer/real golden standard (like CASP) Superb community
20
Reese, E-GASP 2005 20 Open questions How many protein coding genes/loci missed? How many total human protein coding loci are there? (Dro <14,500) How much and what is the function of array detected transcripts (coding non-coding?) Can we get an exhaustive alternative splicing “golden standard”?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.