Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Cells Structural and functional units of living organisms.
Tucson High School Biotechnology Course Spring 2010.
V. parahaemolyticus Sodium transport genes & Osmoregulatory pumps Andrea, Saikumar, Stacey, & Cesar Andrea, Saikumar, Stacey, & Cesar (Kozo, et. al, 2002)
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
Metagenomics Rob Edwards MCS. The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced.
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Sequencing All of Microbial Life: Challenges and Opportunities Rob Edwards Argonne National Laboratory San Diego State University.
Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.
Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Molecular Microbial Ecology
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Development and Application of SNP markers in Genome of shrimp (Fenneropenaeus chinensis) Jianyong Zhang Marine Biology.
Chapter 21 Eukaryotic Genome Sequences
Viruses. Nonliving particles Very small (1/2 to 1/100 of a bacterial cell) Do not perform respiration, grow, or develop Are able to replicate (only with.
Viruses Living or Not ???????. Characteristics of Viruses Among the smallest biological particles that are capable of causing diseases in living organisms.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Supporting Scientific Collaboration Online SCOPE Workshop at San Diego Supercomputer Center March 19-22, 2008.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Metagenomics.
Viruses. Nonliving particles Very small (1/2 to 1/100 of a bacterial cell) Do not perform respiration, grow, or develop Are able to replicate (only with.
SGM Meeting, Warwick, April 2006
Cells Structural and functional units of living organisms.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing.
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Using Computers to Understand Life: from Bacteria and Viruses to Corals and Fishes Rob Edwards SDSURF 2011.
Copyright © 2005 Pearson Education, Inc. publishing as Benjamin Cummings PowerPoint TextEdit Art Slides for Biology, Seventh Edition Neil Campbell and.
Real time metagenomics Ross Overbeek Bob Olson Terry Disz Liz Dinsdale.
Prokaryotes capture solar energy
Introduction to Viruses
Rob Edwards San Diego State University
Viruses.
The bioinformatics behind
Considerations for metagenomics data analysis and summary of workflows
Classification of Living Things
Genomes and Their Evolution
Viruses Living a borrowed life
Fig Figure 19.1 Are the tiny viruses infecting this E. coli cell alive? 0.5 µm.
First Semester Exam Honors 2
Mariya Munir, Terence L. Marsh, and Irene Xagoraraki Background
محاضرة عامة التقنيات الحيوية (هندسة الجينات .. مبادئ وتطبيقات)
Genome Center of Wisconsin, UW-Madison
This presentation uses animations and is best viewed as a slide show.
Viruses.
Viruses.
Metagenomics Microbial community DNA extraction
Big Questions: What is a virus? How does a virus function?
Knowledge of Hot Springs
Meatgenome Analysis Project Bioinformatics 301
Viruses.
“A virus is a piece of bad news wrapped in a protein.”
Viruses & Prokaryotes.
Agenda 4/8 Biotech Intro Uses for Bacteria and Viruses
1.1.3 MI.
Virus Characteristics
VIRUSES Biology 11.
Viruses.
Annotations, Subsystems based approach
2.2 Viruses, Viroids, Prions
Condor: BLAST Tuesday, Dec 7th, 10:45am
Most virus-related pyrosequencing reads found in raw sewage represent previously unknown viruses. Most virus-related pyrosequencing reads found in raw.
Presentation transcript:

Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL

Outline ● Biology | Metagenomics | Yikes! Biology | Metagenomics | Yikes! ● (More biology?) (More biology?) ● Bioinformatics Bioinformatics ● Things Google could do Things Google could do ● Things we do with Google Things we do with Google FirstOutlineLast

First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing FirstOutlineLast

Everybody in Google Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments FirstOutlineLast Year

Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations? FirstOutlineLast

Human-associated viruses More bacteria than somatic (human) cells by at least an order of magnitude More viruses than bacteria by an order of magnitude Sample the things in the intestine by sampling the viruses FirstOutlineLast

Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Known 40% Unknown 60% Breitbart (2003) J. Bacteriol. Phages 94% Eukaryotic Viruses 6% FirstOutlineLast

Most Human RNA Viruses are Known Known 92% Unknown 8% Pepper Mild Mottle Virus 65% Other Plant Viruses 9% Other 26% Zhang (2006) PLoS Biology FirstOutlineLast

Pepper Mild Mottle Virus (PMMV) ssRNA virus; ≈6 kb genome Related to Tobacco Mosaic Virus Infects members of Capsicum family Widely distributed – spread through seeds Fruits are small, malformed, mottled Rod-shaped virions TOBACCO MOSAIC VIRUS ppi/links/pplinks/virusems/ Viral particles in fecal sample FirstOutlineLast

S1S1 S2S3S4S5S6S7S8S9PMMV PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV San Diego : 78% people are positive Singapore : 67% people are positive fold increase in feces compared to food PMMV copies per gram dry weight of feces FirstOutlineLast

Indian curry Pork noodle red chili Chicken rice Chinese food Hong Kong chili sauce Hong Kong green chili Vegetarian chili Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS FirstOutlineLast

Where Next? ● More (but not much more) biology? More (but not much more) biology? ● Less biology Less biology ● No biology FirstOutlineLast

Phages, Reefs, Human Disturbance FirstOutlineLast

Phages, Reefs, Human Disturbance FirstOutlineLast

Different Bacteria At Each Island FirstOutlineLast

More People == More Pathogens Negative numbers mean relatively more phage hosts at Kingman FirstOutlineLast

Bioinformatics Tools FirstOutlineLast

The SEED Family FirstOutlineLast

The metagenomics RAST server FirstOutlineLast

Automated Processing FirstOutlineLast

Hours of Compute Time Input size (MB) Computational Requirements ~19 hours of compute per input megabyte FirstOutlineLast

FirstOutlineLast

Computational Time FirstOutlineLast

How much so far Total: 2,740 metagenomes 255,178,533 sequences 65,595,200,612 bp (53 Gbp) Public: 299 Metagenomes 45,445,163 sequences 19,341,509,132 bp (19 Gbp) Compute time (on a single CPU): 1,246,308 hours = 51,929 days = 142 years FirstOutlineLast

Metagenomics Tools Annotation & Subsystems FirstOutlineLast

Lots of sequences all pyrosequencing FirstOutlineLast

Sulfur CDA 60.2% CDA 21.7% Respiration Capsule Motility Membrane transport Stress Signalin g Phosphorus RNA Mine Saltern Marine Microbialites Coral Fis h Animals Freshwater From Sequences To Environments Dinsdale et al, Nature 2008 FirstOutlineLast

Chickens, Cows, Mice, and People; Oh my! FirstOutlineLast

Virulence Subsystems In The Intestines Qu et al, PNAS, 2009 FirstOutlineLast

Microbial Virulence Genes Discriminate Hosts Qu et al, PNAS, 2009 FirstOutlineLast

Marine Near-shore water Off-shore water Near- and off-shore sediments Metazoan associated Corals Fish Human Sampling Sites Terrestrial/Soil NEON sites Urban Airborne Freshwater Aquifer Glacial lake Extreme Hot springs (84oC; 78oC) Soda lake (pH 13) Solar saltern (>35% salt) FirstOutlineLast

FirstOutlineLast

Searching (Text) ● Searching for genes (names, functions, text strings) ● Searching for controlled vocabulary terms (Subsystems, GO terms) ● Federating disparate data ● NCBI, SEED, JGI, EBI, DDBJ NCBISEEDJGIEBIDDBJ ● Annotation clearinghouse Annotation clearinghouse Desir e FirstOutlineLast

Web services FirstOutlineLast

Searching (Sequence) ● Searching for [DNA, protein] ● A better BLAST search ● Separate word matching from extension/scoring ● Perfectly (embarrassingly) parallel Desir e FirstOutlineLast

Desir e How BLAST Works Protein sequence Filter for words above a threshold Find all words in the protein sequence (>3 letters by default) Extend while score is above another threshold Calculate & report final score for alignment high scoring pairs Map Reduce FirstOutlineLast

● Google App Engine // GWT to extract information ● Searching | Browsing | Annotation ● 1Mb limit too small Data Visualization Desir e FirstOutlineLast

Data Visualization oror Desir e FirstOutlineLast

SEED/KML/PostGIS Liz Dinsdale (Biology) Bahador Nosrat (Msc student) Doin g Data Mapping Satellite photosynthesis vs. photosynthesis genes Pathogens around Kiritimati island FirstOutlineLast

Open Social Doin g FirstOutlineLast

Open Social Doin g Vasken Kamikisissian; Matt Seitz (Undergraduates) FirstOutlineLast

Doin g Open Social FirstOutlineLast

Acknowledgements Environmental Genomics Forest Rohwer Brian White Mya Breitbart All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Folker Meyer Bob Olson Daniel Paarman Mark D'Souza Jared Wilkening Andreas Wilke Statistics & Web services Liz Dinsdale Robert Schmieder Dana Hall Beltran Rodriguez- Brito Bahador Nosrat FIG Ross Overbeek Veronika Vonstein Annotators Artist Paula Morris Argonne Sequencing Marc Domanus Areej Ammar FirstOutlineLast