Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.

Similar presentations


Presentation on theme: "Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University."— Presentation transcript:

1 Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University

2 The Problem (I) 1,000 2,000 3,000 4,000 5,000 199 6 20002004 2008 X XXXX X X X X X Complete Genomes Year The 1,000th genome will be sequenced soon. www.nmpdr.orgwww.theseed.org

3 The Problem (II) Growth of sequencing versus annotation. www.nmpdr.orgwww.theseed.org

4 The SEED and NMPDR The SEED : 504 complete genomes Archaea30 Bacteria445 Eukarya29 NMPDR : 62 complete genomes Campylobacter10 Listeria16 Staphylococcus11 Streptococcus14 Vibrio11 www.nmpdr.orgwww.theseed.org

5 The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes?? www.nmpdr.orgwww.theseed.org

6 Basic biology lacZlacIlacYlacA www.nmpdr.orgwww.theseed.org

7 Different types of clustering < 80 % www.nmpdr.orgwww.theseed.org

8 Actinobacteria Aquificae Bacteroidetes Chlamydiae Chloroflexi Cyanobacteria Deinococcus- Thermus Firmicutes Spirochaetes Thermotogae Proteobacteria 1 0.8 0.6 0.4 0.2 0 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group Fraction of genes in clusters Number of genomes 0 40 80 120 Average Occurrence of clustering in different genomes

9 The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” –collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product –atomic, or user-defined, examples: 6-phosphofructokinase (EC 2.7.1.11) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms www.nmpdr.orgwww.theseed.org

10 Thermatoga: ~40% of genes covered in 304 different subsystems

11 Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data … www.nmpdr.orgwww.theseed.org

12 How do we measure annotations? www.nmpdr.orgwww.theseed.org

13 Natural Metrics Number of subsystems defined Number of functional roles defined Number of genes connected to functional roles www.nmpdr.orgwww.theseed.org

14 Annotations for NMPDR Genomes www.nmpdr.orgwww.theseed.org

15 www.nmpdr.orgwww.theseed.org Applied Metrics Number of solid connections of gene to functional role where “solid” is 1.supported by experimental data 2.connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem 3.only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity)

16 Applied Metrics www.nmpdr.orgwww.theseed.org

17 The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays www.nmpdr.orgwww.theseed.org

18 hisA FIG function: Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase] www.nmpdr.orgwww.theseed.org

19 Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family 1."consistency" is the odds that two proteins from the same family have the same function 2.Evaluate both families and functions. www.nmpdr.orgwww.theseed.org

20 Consistency among databases www.nmpdr.orgwww.theseed.org

21 Number of RefSeq proteins in families www.nmpdr.orgwww.theseed.org

22 How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server www.nmpdr.orgwww.theseed.org

23 FIG Veronika Vonstein Ross Overbeek Annotators ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz Acknowledgements SEED: http://www.theseed.org/ NMPDR: http://www.nmpdr.org/ RAST: http://www.nmpdr.org/anno-server metaRAST:http://metagenomics.theseed.org/


Download ppt "Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University."

Similar presentations


Ads by Google