What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship.

Slides:



Advertisements
Similar presentations
Integration of Prokaryotic Genomics into the Unknown Microbe ID Lab Bert Eardley – Penn State, Berks & Dan Golemboski – Bellarmine University.
Advertisements

Tucson High School Biotechnology Course Spring 2010.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
Comparative Genomics Virulence in E. coli Diversity of Genomes How Many Genomes are There? Different Genome Perspectives.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Metagenomics in Phage Ecology Peter Salamon SDSU REU 2007.
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Central Dogma Information storage in biological molecules DNA RNA Protein transcription translation replication.
Metagenomics Rob Edwards MCS. The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced.
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
Are transposons selfish? Rob Edwards Ramy Aziz.
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.
Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University.
Lecture 1. Microorganisms: an overview Chapter 1. Microorganisms and Microbiology Chapter 2. An overview of microbial life. Cell and viral structures DNA.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
The Microbiome and Metagenomics
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Environmental Genome Shotgun Sequencing of the Sargasso Sea
Tools for comparative genomics and expert annotations.
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
Molecular Microbial Ecology
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Probes can be designed in an evolutionary hierarchy.
Cottrell, M. T., L. A. Waldner, L. Yu, and D. L. Kirchman Bacterial diversity of metagenomic and PCR libraries from the Delaware River. Environmental.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Subsystem: Succinate dehydrogenase The super-macromolecular respiratory complex II (succinate:quinone oxidoreductase) couples the oxidation of succinate.
Microbial genomics Genomics: study of entire genomes Logical next step after genetics: study of genes Genomics: 1) “Structural genomics” * Determine and.
Big Picture Of ≈1.7 million species classified so far, roughly 6000 are microbes True number of microbes is obviously larger than 6000 “Imagine if our.
National Microbial Pathogen Data Resource Connecting Bioinformatics to the Bench Leslie Klis McNeil NCSA, University of Illinois, Urbana.
Operated by Los Alamos National Security, LLC for NNSA Bioscience Discovering virulence genes present in novel strains and metagenomes Chris Stubben IC.
Anis Karimpour-Fard 1, Corrella Detweiler 2, Ryan T. Gill 3, and Lawrence Hunter 1 1 University of Colorado School of Medicine 2 MCD-Biology, University.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Annotation. Traditional genome annotation BLAST Similarities.
SGM Meeting, Warwick, April 2006
CS273a A Zero-Knowledge Based Introduction to Biology Courtesy of George Asimenos.
I. Introduction Tetrahydrobiopterin (BH4) is a cofactor used in various processes. It has been extensively studied in mammalian systems were BH4 has a.
Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et. al (2004) Presented by Ken Vittayarukskul Steven S. White.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Metagenomics The study of metagenomes, genetic material recovered directly from environmental samples. Term: Coined in 1998 to refer to the idea that a.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing.
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Prokaryotes capture solar energy
Rob Edwards San Diego State University
The Integrated Microbial Genome (IMG) systems
Taxonomic distribution of large DNA viruses in the sea
a) SW PO TFS AHEC HAFS PSMS BSM RSM POCR Relative abundance
Mariya Munir, Terence L. Marsh, and Irene Xagoraraki Background
The Mimivirus Giant double stranded DNA virus Discovered in amoebas
Genomic Data Manipulation
Genomes and Their Evolution
Genomes with Fe-S cluster assembly-related genes.
Evolution of Genomes Chapter 21.
Annotations, Subsystems based approach
Volume 27, Issue 9, Pages (May 2017)
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship for Interpretation of Genomes, San Diego State University, Burnham Institute for Medical Research, IMEC, LLC SIO, San Diego, May 2006

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

The Players FIG: Fellowship for Interpretation of Genomes NMPDR: Natl. Microbial Pathogen Data Resource BRC: NIH Bioinformatics Resource Centers SEED: The SEED database.

How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya

How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya

How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya

How Many Genomes Have Been Sequenced? CompleteDraftTotal Archaea Bacteria Eukarya

When will the 1,000th microbial genome be sequenced? 1,000 2,000 3,000 4,000 5, X XXXX X X X X X Complete Genomes Year

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

The SEED database developed by FIG Current version: 580 Bacteria (342 complete) 38 Archaea (26 complete) 562 Eukarya (29 complete) 1335 Viruses 2 Environmental Genomes

The problem: How do you generate consistent annotations for 1,000 genomes?

Basic biology lacZlacIlacYlacA

Different types of clustering < 80 %

Actinobacteria Aquificae Bacteroidetes Chlamydiae Chloroflexi Cyanobacteria Deinococcus- Thermus Firmicutes Spirochaetes Thermotogae Proteobacteria Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group Fraction of genes in clusters Number of genomes Average Occurrence of clustering in different genomes

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” –collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product –atomic, or user-defined, examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p Streptococcal virulence factors Does not contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles

Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data …

Example Subsystem: Histidine Degradation Conversion of histidine to glutamate Functional roles defined in table Inclusion in subsystem is only by functional role Controlled vocabulary …

Subsystem Spreadsheet Column headers taken from table of functional roles Rows are selected genomes or organisms Cells are populated with specific, annotated genes Functional variants defined by the annotated roles Variant code -1 indicates subsystem is not functional Clustering shown by color OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet

“The Populated Subsystem” OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet

Subsystem Diagram Three functional variants Universal subset has three roles, followed by three alternative paths from IV to VI No ForI known experimentally

Subsystem Spreadsheet Prediction from subsystems confirmed experimentally OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0Q9A9L9 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00Q88CZ3 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8Q8PAA5 Listeria monocytogenes Subsystem Spreadsheet

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

How do bacteria make methionine? acquire homoserine convert cysteine to cystathione convert cystathione to homocysteine acquire met or convert homocysteine to methionine sulfur and acetylhomoserine sulfhydralase

? ? Missing genes

Cyanoseed:

Marineseed:

predicted or measured co-regulation genome context (virulence islands, prophages, conserved gene clusters) virulence mechanism cellular localization enzymatic activity common phenotype combinations of criteria Subsystems are not just for gene clusters

How much progress has been made? 541 subsystems encoded 80 – 85% of the genes in core machinery are contained in subsystems 30 – 35% of genes in NMPDR organism genomes, 20 – 30% of other genomes contained in subsystems

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

Metagenomics 200 liters water g fresh fecal matter DNA/RNA LASL Sequence Epifluorescent Microscopy Concentrate and purify viruses Extract nucleic acids Breitbart et al., multiple papers

Control datasets for metagenome comparisons Bacteria952,758 Archaea49,694 Eukarya259,653 Acid mine7,588 Sargasso (without Shewanella, Burkholderia) 960,561 Sorcerer II~13,000,000 Number of proteins in different datasets

Subsystems per million CDS

Determination of Statistical Differences Between Metagenomes Take 10,000 proteins from sample 1 Count frequency of each subsystem Repeat 20,000 times Repeat for sample 2 Combine both samples Sample 10,000 proteins 20,000 times Build 95% CI Compare medians from samples 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics

Sampling Sargasso and “SEED” metagenomes

Comparison of all Subsystems More in SargassoMore in SEED

Is serine being used as an osmolyte? Few trehalose, proline, sucrose synthetic genes Serine is most abundant amino acid in ocean (Suttle, Keil) Serine is more effective osmoprotectant than glycine betaine (Yancey)

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

Metagenomics 200 liters water g fresh fecal matter DNA/RNA LASL Sequence Epifluorescent Microscopy Concentrate and purify viruses Extract nucleic acids Breitbart et al., multiple papers 454 So 2004

454 Sequence Data (Only from Rohwer Lab, in one year) 42 libraries –22 microbial, 20 phage 1,028,563,420 bp total –33% of the human genome –95% of all complete and partial bacterial genomes –10% of community sequencing of JGI per year 9,933,184 sequences –Average 236,511 per library Average read length bp –Av. read length has not increased in 12 months

The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced

Red and Black Samples Are Different Cloned and 454 sequenced 16S are indistinguishable Black stuff Red Cloned Red

There are different amounts of metabolism in each environment

There are different amounts of substrates in each environment Black Stuff Red Stuff

But are the differences significant? Sample 10,000 proteins from site 1 Count frequency of each “subsystem” Repeat 20,000 times Repeat for sample 2 Combine both samples Sample 10,000 proteins 20,000 times Build 95% CI Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics

Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe 2+, ferroan [(Mg,Fe) 6 (Si,Al) 4 O 10 (OH) 8 ]) Red stuff: ferric iron (goethite [FeO(OH)])

Nitrification differentiates the samples Edwards (2006) BMC Genomics

The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation

We can cheaply compare the important biochemistry happening in different environments We don’t care which organisms are doing the metabolism but we know what organisms are there

Outline Sequencing statistics scare skeptics The SEED database Some simply stunning Subsystems Mysterious missing methionine metabolism Marine metabolism mined from metagenomics Fabulous four-five-four for facile functional findings Marine phage most puzzling

Phages In The Worlds Oceans GOM 41 samples 13 sites 5 years SAR 1 sample 1 site 1 year BBC 85 samples 38 sites 8 years ARC 56 samples 16 sites 1 year LI 4 sites 1 year

Phages, Reefs, and Human Disturbance The Northern Line Islands Expedition, 2005 Christmas Kingman Christmas Kingman Palmyra Washington Fanning

16S rDNA at each island

16S rDNA of the Proteobacteria

Phages at each island

Christmas to Kingman Bias in No. Phage Hosts Negative numbers mean relatively more phage hosts at Kingman

Phages In The Worlds Oceans GOM 41 samples 13 sites 5 years SAR 1 sample 1 site 1 year BBC 85 samples 38 sites 8 years ARC 56 samples 16 sites 1 year LI 4 sites 1 year

Most Marine Phage Sequences are Novel

Thanks: Mya Breitbart Phages are specific to environments Phage Proteomic Tree v. 5 (Edwards, Rohwer) ssDNA -like T7-like T4-like

Marine Single-Stranded DNA Viruses 6% of SAR sequences ssDNA phage (Chlamydia-like Microviridae) 40% viral particles in SAR are ssDNA phage Several full-genome sequences were recovered via de novo assembly of these fragments Confirmed by PCR and sequencing

12,297 sequence fragments hit using TBLASTX over a ~4.5 kb genome SAR Aligned Against the Chlamydia  4 Individual sequence reads Chlamydia phi 4 genome Coverage Concatenated hits

Summary You only need to remember: Subsystems are the best way to annotate genomes 454 generates lots of data We can use subsystems to find out what is going on in the environment

SDSU Forest Rohwer Beltran Brito-Rodriguez Linda Wegley USF Mya Breitbart University of Bielefeld Folker Meyer Lutz Krause FIG Veronika Vonstein Ross Overbeek Gordon Pusch ANL Rick Stevens Bob Olsen Terry Disz Annotators Gary Olsen Andrei Ostermann Olga Zagnitko Olga Vassieva Svetlana Gerdes Ramy Aziz UBC Curtis Suttle Amy Chan