Download presentation
Presentation is loading. Please wait.
1
Annotations, Subsystems based approach
Rob Edwards Argonne National Labs San Diego State University
2
The Problem (I) The 1,000th genome will be sequenced soon.
2,000 3,000 4,000 5,000 1996 2000 2004 2008 X Complete Genomes Year
3
The Problem (II) Growth of sequencing versus annotation. www.nmpdr.org
4
The SEED and NMPDR The SEED : 504 complete genomes
Archaea 30 Bacteria 445 Eukarya 29 NMPDR : 62 complete genomes Campylobacter 10 Listeria 16 Staphylococcus 11 Streptococcus 14 Vibrio
5
The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes??
6
Basic biology lacI lacZ lacY lacA
7
Different types of clustering
< 80 % < 80%
8
Occurrence of clustering in different genomes
1 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group 120 0.8 Fraction of genes in clusters 0.6 80 Number of genomes 0.4 40 0.2 Aquificae Average Chlamydiae Chloroflexi Deinococcus- Thermus Firmicutes Actinobacteria Bacteroidetes Cyanobacteria Spirochaetes Thermotogae Proteobacteria
9
The Subsystems Approach to Annotation
Subsystem is a generalization of “pathway” collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product atomic, or user-defined, examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms
10
Thermatoga: ~40% of genes covered in 304 different subsystems
11
Subsystems developed based on
Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data …
12
How do we measure annotations?
13
Natural Metrics Number of subsystems defined
Number of functional roles defined Number of genes connected to functional roles
14
Annotations for NMPDR Genomes
15
Applied Metrics Number of solid connections of gene to functional role where “solid” is supported by experimental data connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity)
16
Applied Metrics
17
The Importance of Consistency
Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays
18
hisA FIG function: Other functions in RefSeq: www.nmpdr.org
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC ) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4- imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]
19
Measuring Consistency
Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family "consistency" is the odds that two proteins from the same family have the same function Evaluate both families and functions.
20
Consistency among databases
21
Number of RefSeq proteins in families
22
How to measure accuracy
If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server
23
Acknowledgements SEED: http://www.theseed.org/
NMPDR: RAST: metaRAST: ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz FIG Veronika Vonstein Ross Overbeek Annotators
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.