Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division, Argonne National Laboratory NSF/EU Cyberinfrastructure Meeting, Washington, DC.
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing
Everybody in San Diego Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments
What do we want from annotations? Consistent Accurate Available Reliable
Consistent
The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays
hisA FIG function: Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC ) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]
Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family 1."consistency" is the odds that two proteins from the same family have the same function 2.Evaluate both families and functions.
Consistency among databases
Accurate
How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness
Available
Free service User registration/log in Free to upload sequences in several formats Automatically annotates sequences Download in several formats Complete genomes too: Soon to come: Plasmids, phages, other short genomes
Metagenome Metabolic Reconstruction
Metabolic potential in environments
Phylogenomics
Comparing Metagenomes to Genomes (or other metagenomes!)
Reliable (Believable)
Metabolic potential in environments
Sulfur CDA 60.2% CDA 21.7% Respiration Capsule Motility Membrane transport Stress Signaling Phosphorus RNA Mine Saltern Marine Microbialites Coral Fish Animals Freshwater From sequences to environments
What do we want from annotations? Consistent Accurate Available Reliable When do we want it? NOW
Acknowledgements Environmental Genomics Forest Rohwer Rohwer lab members All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Daniel Paarman Folker Meyer Bob Olsen Statistics Liz Dinsdale Dana Hall Beltran Rodriguez-Brito FIG Ross Overbeek Veronika Vonstein Annotators
Subsystems make up metabolism Wikipedia Metabolism