Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
Outline ● Biology | Metagenomics | Yikes! Biology | Metagenomics | Yikes! ● (More biology?) (More biology?) ● Bioinformatics Bioinformatics ● Things Google could do Things Google could do ● Things we do with Google Things we do with Google FirstOutlineLast
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing FirstOutlineLast
Everybody in Google Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments FirstOutlineLast Year
Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations? FirstOutlineLast
Human-associated viruses More bacteria than somatic (human) cells by at least an order of magnitude More viruses than bacteria by an order of magnitude Sample the things in the intestine by sampling the viruses FirstOutlineLast
Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Known 40% Unknown 60% Breitbart (2003) J. Bacteriol. Phages 94% Eukaryotic Viruses 6% FirstOutlineLast
Most Human RNA Viruses are Known Known 92% Unknown 8% Pepper Mild Mottle Virus 65% Other Plant Viruses 9% Other 26% Zhang (2006) PLoS Biology FirstOutlineLast
Pepper Mild Mottle Virus (PMMV) ssRNA virus; ≈6 kb genome Related to Tobacco Mosaic Virus Infects members of Capsicum family Widely distributed – spread through seeds Fruits are small, malformed, mottled Rod-shaped virions TOBACCO MOSAIC VIRUS ppi/links/pplinks/virusems/ Viral particles in fecal sample FirstOutlineLast
S1S1 S2S3S4S5S6S7S8S9PMMV PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV San Diego : 78% people are positive Singapore : 67% people are positive fold increase in feces compared to food PMMV copies per gram dry weight of feces FirstOutlineLast
Indian curry Pork noodle red chili Chicken rice Chinese food Hong Kong chili sauce Hong Kong green chili Vegetarian chili Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS FirstOutlineLast
Where Next? ● More (but not much more) biology? More (but not much more) biology? ● Less biology Less biology ● No biology FirstOutlineLast
Phages, Reefs, Human Disturbance FirstOutlineLast
Phages, Reefs, Human Disturbance FirstOutlineLast
Different Bacteria At Each Island FirstOutlineLast
More People == More Pathogens Negative numbers mean relatively more phage hosts at Kingman FirstOutlineLast
Bioinformatics Tools FirstOutlineLast
The SEED Family FirstOutlineLast
The metagenomics RAST server FirstOutlineLast
Automated Processing FirstOutlineLast
Hours of Compute Time Input size (MB) Computational Requirements ~19 hours of compute per input megabyte FirstOutlineLast
FirstOutlineLast
Computational Time FirstOutlineLast
How much so far Total: 2,740 metagenomes 255,178,533 sequences 65,595,200,612 bp (53 Gbp) Public: 299 Metagenomes 45,445,163 sequences 19,341,509,132 bp (19 Gbp) Compute time (on a single CPU): 1,246,308 hours = 51,929 days = 142 years FirstOutlineLast
Metagenomics Tools Annotation & Subsystems FirstOutlineLast
Lots of sequences all pyrosequencing FirstOutlineLast
Sulfur CDA 60.2% CDA 21.7% Respiration Capsule Motility Membrane transport Stress Signalin g Phosphorus RNA Mine Saltern Marine Microbialites Coral Fis h Animals Freshwater From Sequences To Environments Dinsdale et al, Nature 2008 FirstOutlineLast
Chickens, Cows, Mice, and People; Oh my! FirstOutlineLast
Virulence Subsystems In The Intestines Qu et al, PNAS, 2009 FirstOutlineLast
Microbial Virulence Genes Discriminate Hosts Qu et al, PNAS, 2009 FirstOutlineLast
Marine Near-shore water Off-shore water Near- and off-shore sediments Metazoan associated Corals Fish Human Sampling Sites Terrestrial/Soil NEON sites Urban Airborne Freshwater Aquifer Glacial lake Extreme Hot springs (84oC; 78oC) Soda lake (pH 13) Solar saltern (>35% salt) FirstOutlineLast
FirstOutlineLast
Searching (Text) ● Searching for genes (names, functions, text strings) ● Searching for controlled vocabulary terms (Subsystems, GO terms) ● Federating disparate data ● NCBI, SEED, JGI, EBI, DDBJ NCBISEEDJGIEBIDDBJ ● Annotation clearinghouse Annotation clearinghouse Desir e FirstOutlineLast
Web services FirstOutlineLast
Searching (Sequence) ● Searching for [DNA, protein] ● A better BLAST search ● Separate word matching from extension/scoring ● Perfectly (embarrassingly) parallel Desir e FirstOutlineLast
Desir e How BLAST Works Protein sequence Filter for words above a threshold Find all words in the protein sequence (>3 letters by default) Extend while score is above another threshold Calculate & report final score for alignment high scoring pairs Map Reduce FirstOutlineLast
● Google App Engine // GWT to extract information ● Searching | Browsing | Annotation ● 1Mb limit too small Data Visualization Desir e FirstOutlineLast
Data Visualization oror Desir e FirstOutlineLast
SEED/KML/PostGIS Liz Dinsdale (Biology) Bahador Nosrat (Msc student) Doin g Data Mapping Satellite photosynthesis vs. photosynthesis genes Pathogens around Kiritimati island FirstOutlineLast
Open Social Doin g FirstOutlineLast
Open Social Doin g Vasken Kamikisissian; Matt Seitz (Undergraduates) FirstOutlineLast
Doin g Open Social FirstOutlineLast
Acknowledgements Environmental Genomics Forest Rohwer Brian White Mya Breitbart All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Folker Meyer Bob Olson Daniel Paarman Mark D'Souza Jared Wilkening Andreas Wilke Statistics & Web services Liz Dinsdale Robert Schmieder Dana Hall Beltran Rodriguez- Brito Bahador Nosrat FIG Ross Overbeek Veronika Vonstein Annotators Artist Paula Morris Argonne Sequencing Marc Domanus Areej Ammar FirstOutlineLast