Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University.

Slides:



Advertisements
Similar presentations
Tucson High School Biotechnology Course Spring 2010.
Advertisements

Metabarcoding 16S RNA targeted sequencing
Lesson Overview 1.3 Studying Life.
What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship.
A Look into the Process of Marker Development Matt Robinson.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Metagenomics in Phage Ecology Peter Salamon SDSU REU 2007.
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Central Dogma Information storage in biological molecules DNA RNA Protein transcription translation replication.
Metagenomics Rob Edwards MCS. The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced.
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Lecture 1. Microorganisms: an overview Chapter 1. Microorganisms and Microbiology Chapter 2. An overview of microbial life. Cell and viral structures DNA.
Methods in Microbial Ecology
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Microbial Genomes Features Analysis Role of high-throughput sequencing Yeast - the eukaryotic model microbe Databases –TIGR CMR –NCBI Microbial Genomes.
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
Molecular Microbial Ecology
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Probes can be designed in an evolutionary hierarchy.
Diversity of uncultured candidate division SR1 in anaerobic habitats James P. Davis Microbial & Molecular Genetics Oklahoma State University.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Microbial Diversity Scott Clingenpeel. Complete Genomes 2671 Prokaryotic genomes in GenBank 114 Eukaryotic genomes – 9 Plants – 52 Animals Do we really.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
NIS - BIOLOGY Lecture 57 – Lecture 58 DNA Technology Ozgur Unal 1.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Shatha Khalil Ismael. Transformation Certain species of Gram- negative, gram- positive bacteria and some species of Archaea are transformable. The uptake.
Microbial genomics Genomics: study of entire genomes Logical next step after genetics: study of genes Genomics: 1) “Structural genomics” * Determine and.
Big Picture Of ≈1.7 million species classified so far, roughly 6000 are microbes True number of microbes is obviously larger than 6000 “Imagine if our.
Diversity and quantification of candidate division SR1 in various anaerobic environments James P. Davis and Mostafa Elshahed Microbiology and Molecular.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Metagenomics.
AP Environmental Science
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Genomes & The Tree of Life
Accurate estimation of microbial communities using 16S tags
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
SGM Meeting, Warwick, April 2006
What is BLAST? Basic BLAST search What is BLAST?
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Metagenomics The study of metagenomes, genetic material recovered directly from environmental samples. Term: Coined in 1998 to refer to the idea that a.
Boundless Lecture Slides Free to share, print, make copies and changes. Get yours at Available on the Boundless Teaching Platform.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing.
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
Risheng Chen et al BMC Genomics
Virginia Commonwealth University
Rob Edwards San Diego State University
The bioinformatics behind
Taxonomic distribution of large DNA viruses in the sea
Metagenomics Rob Edwards.
Comparative metagenomics quantifying similarities between environments
Genomic Data Manipulation
This presentation uses animations and is best viewed as a slide show.
Genomes and Their Evolution
H = -Σpi log2 pi.
Metagenomics Microbial community DNA extraction
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research

Outline How and why we sequence environments Viral metagenomics –Marine stories –Human stories Pyrosequencing –Mine story Is there a Future?

Why Metagenomics? What is there? How many are there? What are they doing?

How do you sequence the environment? Extract DNA

1.1 g ml g ml g ml g ml -1 CsCl step gradient

How do you sequence the environment? Extract DNA Create library

Linker-Amplified Shotgun Libraries (LASLs) This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA Soil Extraction Kit David Mead - Breitbart (2002) PNAS

How do you sequence the environment? Extract DNA Create library Sequence fragments

Outline How and why we sequence environments Viral metagenomics –Marine stories –Human stories Pyrosequencing –Mine story Is there a Future?

Why Phages? Phages are viruses that infect bacteria –10:1 ratio of phages:bacteria –10 31 phages on the planet Specific interactions (probably) –one virus : one host Small genome size –Higher coverage Horizontal gene transfer – bp DNA per year in the oceans

Uncultured Viruses 200 liters water g fresh fecal matter DNA/RNA LASL Sequence Epifluorescent Microscopy Concentrate and purify viruses Extract nucleic acids

Bioinformatics BLASTagainst NR –blastx, tblastn, tblastx BLAST against boutique databases –Complete phage genomes, ACLAME, Other libraries, 16S Parsing to present data in a useful format

BLAST and Parsing Submit BLAST to local and remote databases –Local (as fast as possible) –NCBI (one search every 3 seconds) Many concurrent searches –One search versus 1,000 searches Parse data into tables –Access to taxonomy etc

Known 22% Unknown 78% Most Viral Genes are Unknown Breitbart (2002) PNAS Rohwer (2003) Cell TBLAST (E<0.001) 3,093 sequences

60 billion base pairs 60 million sequences GenBank has more than doubled since 2001 …

GenBank has more than doubled since 2001 … but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol.

All of the new genes in the databases are coming from environmental sequences

Outline How and why we sequence environments Viral metagenomics –Marine stories –Human stories Pyrosequencing –Mine story Is there a Future?

Human-associated viruses More bacteria than somatic cells by at least an order of magnitude More phages than bacteria by an order of magnitude Sample the bacteria in the intestine by sampling their phage

Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Known 40% Unknown 60% Breitbart (2003) J. Bacteriol. TBLAST (E<0.001) 532 sequences Phages 94% Eukaryotic Viruses 6%

No bacteria or viruses in 1 st fecal sample Abundant bacterial and viral communities by 1 week of age Adults Versus Babies >10 8 VLP/g feces

Baby Feces Viruses Most sequences are unknown (≈70%) Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts From microarray studies, sequences are stable in the baby over a 3 month period Same types of phage as present in adult feces –one identical sequence to an unrelated adult!

DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses?

Most Human RNA Viruses are Known Known 92% Unknown 8% TBLAST (E<0.001) ≈36,000 sequences Pepper Mild Mottle Virus 65% Other Plant Viruses 9% Other 26% Zhang (2006) PLoS Biology

Pepper Mild Mottle Virus (PMMV) ssRNA virus; ≈6 kb genome Related to Tobacco Mosaic Virus Infects members of Capsicum family Widely distributed – spread through seeds Fruits are small, malformed, mottled Rod-shaped virions TOBACCO MOSAIC VIRUS k/ppi/links/pplinks/virusems/ Viral particles in fecal sample

S1S2S3S4S5S6S7S8S9PMMV PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV San Diego : 78% people are positive Singapore : 67% people are positive fold increase in feces compared to food PMMV copies per gram dry weight of feces

Indian curry Pork noodle red chili Chicken rice Chinese food Hong Kong chili sauce Hong Kong green chili Vegetarian chili Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS

Thesunmachine.net Koch’s Postulates

Human microbial metagenome is more important than human genome

Outline How and why we sequence environments Viral metagenomics –Marine stories –Human stories Pyrosequencing –Mine story Is there a Future?

How do you sequence the environment? Extract DNA Create library Sequence fragments Everything so far from 40,000 sequences This is so 2004

How do you sequence the environment? Extract DNA Pyrosequence

454 Pyrosequencing Emulsion-based PCR Luciferase-based sequencing } SDSU DNA extraction from environment Whole genome amplification } 454 Inc. Margulies (2005) Nature

454 Sequence Data (Only from Rohwer Lab) 21 libraries –10 microbial, 11 phage 597,340,328 bp total –20% of the human genome –50% of all complete and partial microbial genomes 5,769,035 sequences –Average 274,716 per library Average read length bp –Av. read length has not increased in 7 months

Growth of sequence data 6 million reads 600 million bp

Cost of sequencing One reaction = $10,000 One reaction = 250,000 reads 250 reads = $10 1 read = 4¢ 1 read = 100bp 1 bp = 0.04¢ ($400 per 1x 1,000,000 bp) Sanger sequencing ca. $1/rxn, 0.2¢/bp –real cost ca. $5/rxn, 1¢/bp 454 sequencing does cot require cloning, arraying etc.

Bioinformatics 597,340,328 bp total 5,769,035 sequences 7 months Existing tools are not sufficient

Current Pipeline Dereplicate BLAST against –16S –Complete phage –nr (SEED) –subsystems

Sequencing is cheap and easy. Bioinformatics is neither.

Outline How and why we sequence environments Viral metagenomics –Marine stories –Human stories Pyrosequencing –Mine story Is there a Future?

The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced

Red and Black Samples Are Different Cloned and 454 sequenced 16S are indistinguishable Black stuff Red Cloned Red

Annotation of metagenomes by subsystems A subsystem is a group of genes that work together –Metabolism –Pathway –Cellular structures –Anything an annotator thinks is interesting

There are different amounts of metabolism in each environment

There are different amounts of substrates in each environment Black Stuff Red Stuff

But are the differences significant? Sample 10,000 proteins from site 1 Count frequency of each subsystem Repeat 20,000 times Repeat for sample 2 Combine both samples Sample 10,000 proteins 20,000 times Build 95% CI Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review

Examples of significantly different subsystems Red Stuff Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Stuff Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation

Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe 2+, ferroan [(Mg,Fe) 6 (Si,Al) 4 O 10 (OH) 8 ]) Red stuff: ferric iron (goethite [FeO(OH)])

Nitrification differentiates the samples Edwards (2006) In review

Not all biochemistry happens in a single organism Anaerobic methane oxidation Boetius et al. Nature, CH 4 + SO > HCO HS - + H 2 S Archaea CH 4 + H 2 O -> HCO OH + H 2 -> CO 2 + H 2 O Bacteria SO H 2 O -> HS - + OH + 2O 2

The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation

We are moving away from one organism one reaction and towards studying the biochemistry of whole environments Bacteria don’t live alone

Summary From 454 sequence: –Identify microbial composition –Identify metabolic function –Identify statistically significant differences in metabolism –Who, what, why of microbial ecology

Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments Metazoan associated Corals Fish Human blood Human stool Sampling Sites Terrestrial/Soil Amazon rainforest Konza prairie Joshua Tree desert Singapore Air Freshwater Aquifer Glacial lake Extreme Hot springs (84 o C; 78 o C) Soda lake (pH 13) Solar saltern (>35% salt)

SDSU Forest Rohwer Mya Breitbart Beltran Rodriguez-Brito Rohwer Lab: Linda Wegley Florent Angly Matt Haynes Also at SDSU Anca Segall Willow Segall Stanley Maloy Math Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller MIT: Ed DeLong NSF - Biotic Surveys and Inventories - Biological Oceanography - Biocomplexity FIG Veronika Vonstein Ross Overbeek Annotators Genome Institute of Singapore: Zhang Tao Charlie Lee Chia Lin Wei Yijun Ruan

Viral Community Structure Contigs assembled from fragments with >= 98% identity over 20 bp are a resampling of a single phage genome Contig specturm is the number of contigs that have one sequence, the number that have two sequences, and so on Use both analytical and Monte-Carlo simulations to predict community structure from contig spectrum The Math Guys (2006) In preparation

Determine the actual contig spectrum of the sample Predict a contig spectrum using a species abundance model Compute the error between the actual and predicted Adjust the parameters in the species abundance model to minimize errors Continue this procedure until we obtain the smallest error Find the smallest error, a global minimum Model parameters Error

Fecal Seawater Marine Sediments Viral Communities are Extremely Diverse Lots of rare viral genotypes

Cropland Earthworms Fossil Corals Forest Amphibians River Bacteria Sediment Viruses Seawater Viruses Fecal Viruses Bacteria on Corals Agriculture Soil Bacteria Soil Nematodes Rainforest Spiders Amazon Fish Rainforest Birds Forest Mammals Temperate Forest Beetles Shannon- Wiener Index