What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship for Interpretation of Genomes, San Diego State University, Burnham Institute for Medical Research, IMEC, LLC SIO, San Diego, May 2006
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
The Players FIG: Fellowship for Interpretation of Genomes NMPDR: Natl. Microbial Pathogen Data Resource BRC: NIH Bioinformatics Resource Centers SEED: The SEED database.
How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya
How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya
How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya
How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya
When will the 1,000th microbial genome be sequenced? 1,000 2,000 3,000 4,000 5, X XXXX X X X X X Complete Genomes Year
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
The SEED database developed by FIG Current version: 580 Bacteria (342 complete) 38 Archaea (26 complete) 562 Eukarya (29 complete) 1335 Viruses 2 Environmental Genomes
The problem: How do you generate consistent annotations for 1,000 genomes?
Basic biology lacZlacIlacYlacA
Different types of clustering < 80 %
Actinobacteria Aquificae Bacteroidetes Chlamydiae Chloroflexi Cyanobacteria Deinococcus- Thermus Firmicutes Spirochaetes Thermotogae Proteobacteria Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group Fraction of genes in clusters Number of genomes Average Occurrence of clustering in different genomes
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” –collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product –atomic, or user-defined, examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p Streptococcal virulence factors Does not contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles
Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data …
Example Subsystem: Histidine Degradation Conversion of histidine to glutamate Functional roles defined in table Inclusion in subsystem is only by functional role Controlled vocabulary …
Subsystem Spreadsheet Column headers taken from table of functional roles Rows are selected genomes or organisms Cells are populated with specific, annotated genes Functional variants defined by the annotated roles Variant code -1 indicates subsystem is not functional Clustering shown by color OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet
“The Populated Subsystem” OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet
Subsystem Diagram Three functional variants Universal subset has three roles, followed by three alternative paths from IV to VI No ForI known experimentally
Subsystem Spreadsheet Prediction from subsystems confirmed experimentally OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
How do bacteria make methionine? acquire homoserine convert cysteine to cystathione convert cystathione to homocysteine acquire met or convert homocysteine to methionine sulfur and acetylhomoserine sulfhydralase
? ? Missing genes
Cyanoseed:
Marineseed:
predicted or measured co-regulation genome context (virulence islands, prophages, conserved gene clusters) virulence mechanism cellular localization enzymatic activity common phenotype combinations of criteria Subsystems are not just for gene clusters
How much progress has been made? 541 subsystems encoded 80 – 85% of the genes in core machinery are contained in subsystems 30 – 35% of genes in NMPDR organism genomes, 20 – 30% of other genomes contained in subsystems
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
Metagenomics 200 liters water g fresh fecal matter DNA/RNA LASL Sequence Epifluorescent Microscopy Concentrate and purify viruses Extract nucleic acids Breitbart et al., multiple papers
Control datasets for metagenome comparisons Bacteria952,758 Archaea49,694 Eukarya259,653 Acid mine7,588 Sargasso (without Shewanella, Burkholderia) 960,561 Sorcerer II~13,000,000 Number of proteins in different datasets
Subsystems per million CDS
Determination of Statistical Differences Between Metagenomes Take 10,000 proteins from sample 1 Count frequency of each subsystem Repeat 20,000 times Repeat for sample 2 Combine both samples Sample 10,000 proteins 20,000 times Build 95% CI Compare medians from samples 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics
Sampling Sargasso and “SEED” metagenomes
Comparison of all Subsystems More in SargassoMore in SEED
Is serine being used as an osmolyte? Few trehalose, proline, sucrose synthetic genes Serine is most abundant amino acid in ocean (Suttle, Keil) Serine is more effective osmoprotectant than glycine betaine (Yancey)
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
Metagenomics 200 liters water g fresh fecal matter DNA/RNA LASL Sequence Epifluorescent Microscopy Concentrate and purify viruses Extract nucleic acids Breitbart et al., multiple papers 454 So 2004
454 Sequence Data (Only from Rohwer Lab, in one year) 42 libraries –22 microbial, 20 phage 1,028,563,420 bp total –33% of the human genome –95% of all complete and partial bacterial genomes –10% of community sequencing of JGI per year 9,933,184 sequences –Average 236,511 per library Average read length bp –Av. read length has not increased in 12 months
The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced
Red and Black Samples Are Different Cloned and 454 sequenced 16S are indistinguishable Black stuff Red Cloned Red
There are different amounts of metabolism in each environment
There are different amounts of substrates in each environment Black Stuff Red Stuff
But are the differences significant? Sample 10,000 proteins from site 1 Count frequency of each “subsystem” Repeat 20,000 times Repeat for sample 2 Combine both samples Sample 10,000 proteins 20,000 times Build 95% CI Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics
Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe 2+, ferroan [(Mg,Fe) 6 (Si,Al) 4 O 10 (OH) 8 ]) Red stuff: ferric iron (goethite [FeO(OH)])
Nitrification differentiates the samples Edwards (2006) BMC Genomics
The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation
We can cheaply compare the important biochemistry happening in different environments We don’t care which organisms are doing the metabolism but we know what organisms are there
Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling
Phages In The Worlds Oceans GOM 41 samples 13 sites 5 years SAR 1 sample 1 site 1 year BBC 85 samples 38 sites 8 years ARC 56 samples 16 sites 1 year LI 4 sites 1 year
Phages, Reefs, and Human Disturbance The Northern Line Islands Expedition, 2005 Christmas Kingman Christmas Kingman Palmyra Washington Fanning
16S rDNA at each island
16S rDNA of the Proteobacteria
Phages at each island
Christmas to Kingman Bias in No. Phage Hosts Negative numbers mean relatively more phage hosts at Kingman
Phages In The Worlds Oceans GOM 41 samples 13 sites 5 years SAR 1 sample 1 site 1 year BBC 85 samples 38 sites 8 years ARC 56 samples 16 sites 1 year LI 4 sites 1 year
Most Marine Phage Sequences are Novel
Thanks: Mya Breitbart Phages are specific to environments Phage Proteomic Tree v. 5 (Edwards, Rohwer) ssDNA -like T7-like T4-like
Marine Single-Stranded DNA Viruses 6% of SAR sequences ssDNA phage (Chlamydia-like Microviridae) 40% viral particles in SAR are ssDNA phage Several full-genome sequences were recovered via de novo assembly of these fragments Confirmed by PCR and sequencing
12,297 sequence fragments hit using TBLASTX over a ~4.5 kb genome SAR Aligned Against the Chlamydia 4 Individual sequence reads Chlamydia phi 4 genome Coverage Concatenated hits
Summary You only need to remember: Subsystems are the best way to annotate genomes 454 generates lots of data We can use subsystems to find out what is going on in the environment
SDSU Forest Rohwer Beltran Brito-Rodriguez Linda Wegley USF Mya Breitbart University of Bielefeld Folker Meyer Lutz Krause FIG Veronika Vonstein Ross Overbeek Gordon Pusch ANL Rick Stevens Bob Olsen Terry Disz Annotators Gary Olsen Andrei Ostermann Olga Zagnitko Olga Vassieva Svetlana Gerdes Ramy Aziz UBC Curtis Suttle Amy Chan