C A M E R A A Metagenomics Resource for Microbial Ecology Saul A. Kravitz J. Craig Venter Institute Rockville, Maryland USA KNAW Colloquium May 29, 2008.

Slides:



Advertisements
Similar presentations
Cyber Metagenomics; Challenge to See The Unseen Majority in The Ocean
Advertisements

Creating a Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (a.k.a. CAMERA) Invited Talk Honoring David Kingsbury.
Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) Invited Talk CONNECT Board Meeting La Jolla, CA April 26, 2006.
The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Tucson High School Biotechnology Course Spring 2010.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Metabarcoding 16S RNA targeted sequencing
DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey What is Metagenomics?  Traditional microbial genomics 
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Central Dogma Information storage in biological molecules DNA RNA Protein transcription translation replication.
C A M E R A A Metagenomics Resource for Marine Microbial Ecology July 27, 2007 Paul Gilna UCSD/Calit2 Saul A. Kravitz J. Craig Venter Institute.
The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples Shannon J. Williamson, Douglas.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Viral Genomics Allie Evans Colin Lappala Chelsea Layes Sheena Scroggins.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology Metagenomics Center for Earth Observations and Applications Advisory Committee.
Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T. Kingsbury Gordon and Betty Moore Foundation BERAC - October.
Metagenomic Analysis Using MEGAN4
Development of Bioinformatics and its application on Biotechnology
Molecular Microbial Ecology
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
“Quantified Self- On Being a Personal Genomic Observatory” Keynote in the “Humans as Genomic Observatories” Meeting Session in the Genomics Standards Consortium.
Statistical Tool for Identifying Sequence Variations That Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) July 22,
Conclusions and Future Work (301) Kamal Kumar, Valmik Desai, Li Cheng, Maxim Khitrov, Deepak Grover, Ravi Vijaya Satya,
“Living in a Microbial World” Global Health Program Council on Foreign Relations New York, NY April 10, 2014 Dr. Larry Smarr Director, California Institute.
EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Big Picture Of ≈1.7 million species classified so far, roughly 6000 are microbes True number of microbes is obviously larger than 6000 “Imagine if our.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Sara E. Richardson Calit2 Summer Undergraduate Research Scholarship Program Advisor: Jurgen Schulze Ivl.calit2.net/wiki CAMERA is.
2009 IADR, MIAMI, FL, USA Hands-on Experience for using the Human Oral Microbiome Database (HOMD) 2009 IADR Workshop, Miami, FL, USA Tsute (George) Chen.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
“CAMERA Goes Live!" Presentation with Craig Venter National Press Club Washington, DC March 13, 2007 Dr. Larry Smarr Director, California Institute for.
es/by-sa/2.0/. Metagenomics Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Bioprospecting Lecture 17. Marine sponges with cancer promise Hundreds of compounds isolated from natural environments are in use or in development for.
“Genomics: The CAMERA Project" Invited Talk 5 th Annual ON*VECTOR International Photonics Workshop UCSD February 28, 2006 Dr. Larry Smarr Director,
High throughput biology data management and data intensive computing drivers George Michaels.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Rob Edwards San Diego State University
Metagenomic Species Diversity.
Metagenomics Image: Iverson et al. 2012, Science.
Genomic Data Manipulation
H = -Σpi log2 pi.
Metagenomics Microbial community DNA extraction
Explore Evolution: Instrument for Analysis
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
Screenshot of JCVI's Advanced Reference Viewer ( jcvi
Presentation transcript:

C A M E R A A Metagenomics Resource for Microbial Ecology Saul A. Kravitz J. Craig Venter Institute Rockville, Maryland USA KNAW Colloquium May 29, 2008

Goals Introduce you to CAMERA Encourage you to use CAMERA What can CAMERA do for you?

Presentation Outline Introduction to Metagenomics Global Ocean Sampling (GOS) Expedition CAMERA Capabilities and Features - Compute Resources - Data Resources - Tools Resources Looking Forward

Within an environment - What biological functions are present (absent)? - What organisms are present (absent) Compare data from (dis)similar environments - What are the fundamental rules of microbial ecology Adapting to environmental conditions? - How? - Evidence and mechanisms for lateral transfer Search for novel proteins and protein families - And diversity within known families Metagenomic Questions

Genomics – ‘Old School’ - Study of a single organism's genome - Genome sequence determined using shotgun sequencing and assembly - >1300 microbes sequenced, first in DNA usually obtained from pure cultures (<1%) Metagenomics - Application of genome sequencing methods to environmental samples (no culturing) - Environmental shotgun sequencing is the most widely used approach - Environmental Metadata provides key context Genomics vs Metagenomics

Complexity of Microbial Communities Simple (e.g., AMD, gutless worm) - Few species present (<10) - Diverse  Variations on standard genomics techniques Complex (e.g., Soil or Marine) - Many species present (>10, often >1000) - Many closely related  New techniques

Global Ocean Sampling Expedition

Global Ocean Sampling (GOS) 178 Total Sampling Locations - Phase 1: 7.7M reads, >6M proteins 3/07 - Phase 2-IO: 2.2M reads 3/08 - Phase 2: ~10M reads future Diverse Environments - Open ocean, estuary, embayment, upwelling, fringing reef, atoll… 3/08 3/07 4/04

Most sequence reads are unique - Very limited assembly - Most sequences not taxonomically anchored - Relating shotgun data to reference genomes - Annotation challenging New Techniques Needed - Fragment Recruitment - Extreme Assembly to find pan genomes - Sample to Sample Comparisons GOS: Sequence Diversity in the Ocean Rusch et al (PLoS 2007)

Comparing of Dominant Ribotypes

Comparison of Total Genomic Content

Novel clustering process Sequence similarity based Predict proteins and group into related clusters Include GOS and all known proteins Findings GOS proteins cover ~all existing prokaryotic families expands diversity of known protein families ~10% of large clusters are novel Many are of viral origin No saturation in the rate of novel protein family discovery GOS Protein Analysis Yooseph et al (PLoS 2007)

Rubisco homologs Added Protein Family Diversity Yooseph et al (PLoS 2007) New Groups GOS prokaryotes Known eukaryotes Known prokaryotes

Study of dsDNA viruses from shotgun data - 155k viral proteins identified from 37 GOS I sites (~2.5%) - 59% of viral sequences were bacteriophage Viral acquisition and retention of host metabolic genes is common and widespread - Viruses have made these genes “their own” - Clade tightly with viral genes Codistribution of P-SSM4-like cyanophage and the dominant ecotype of Prochlorococcus in GOS samples. GOS Viral Analysis (Williamson et al PLoSOne 2008)

Viral acquisition of host genes talC Gene GOS Viral Public Viral GOS Bacterial Public Bacterial Public Euk

Reference Genomes Overview reference marine microbes (101 released) - Scaffold for GOS - Sequenced, assembled, autoannotated Isolation Metadata - Incomplete Bottlenecks - Availability of DNA - Purity of DNA Status and Data -

Significant investment in sequencing - Only accessible to bioinformatics elite - Diversity of user sophistication and needs Bioinformatics and Computation Challenges - Assembly, annotation, comparative analysis, visualization - Dedicated compute resources Importance of Metadata - Metadata required for environmental analysis - Need to drive standards Compliance with Convention on Biodiversity Motivations for CAMERA

Convention on Biological Diversity Sample in territorial waters? - Country granted certain rights by CBD - Sampling agreements may contain restrictions CAMERA users must acknowledge potential restrictions on commercial data use CAMERA maintains mapping of country- of-origin for all data objects

CAMERA – “Convenient acronym for cumbersome name…” - Henry Nichols, PLoS Biology Mission - Enable Research in Marine Microbiology Debuted March 2007

CAMERA Capabilities Compute Resources node compute grid Tb storage Data and Metadata Resources - Annotated Metagenomic and genomic data Tools Resources - Scalable BLAST - Fragment Recruitment - Metagenomic Annotation - Text Search

512 Processors ~5 Teraflops ~ 200 Terabytes Storage CAMRA Compute and Storage Complex at UCSD/Calit2 Source: Larry Smarr, Calit2

CAMERA Metagenomic Data Volume by Project

CAMERA Metagenomic Samples

CAMERA Users >2000 Registered Since March 2007

Metagenomic Sequence Collection - Reads and assemblies w/associated metadata - CAMERA-computed annotation Protein Clusters - Maintaining clusters from Yooseph et al (Yooseph and Li, ’08) Genomic Data - Viral, Fungal, pico-Eukaryotes, Microbial - Moore Marine Genomes with Metadata Non-redundant sequence Collection - Genbank, Refseq, Uniprot/Swissprot, PDB etc CAMERA Data Collections

Genome Standards Consortium - Led by Dawn Field, NIEeS - Members from EU, UK, US Goals are to promote - Standardization of genomic descriptions - Exchange & Integration of genomic data Metadata standardization key enabler - MIMS: Min Info for Metagenomic Sample - GCDML: Standard format Standardizing Contextual Metadata

Contextual Metadata Challenges Researchers Need to Collect and Submit Relevant metadata depends on study – MIMS - Specification of minimum metadata Standardize Exchange Format - GCDML - Comprehensive and extensible - Leverages Existing Ontologies, Validatable And… - Easy for a scientist to use... Need ongoing software support for tools

CAMERA Core Metadata by Project Defacto Core Lattitude and Longitude Collection date Habitat and Geographic Location Missing metadata =

CAMERA Contextual Metadata

CAMERA 1.3

Scalable BLAST with Metadata Large searches permitted and encouraged 454 FLX run vs “All Metagenomic” Some larger tblastx jobs have run >20 hrs 10kbp BLASTN vs All Metagenomic – 1 min BLAST XML or Tabular Export Searches against NRAA BLAST XML output feeds MEGAN Searches against ‘All Metagenomic’ GUI with metdata Tabular with metadata

Scalable BLAST with Metadata

Integration of Metadata and Data

Browsing Large Data Collections: Fragment Recruitment Viewer Microbial Communities vs Reference Genomes - Millions of sequence reads vs Thousands of genomes Definition: A read is recruited to a sequence if: - End-to-end blastN alignment exists Rapid Hypothesis Generation and Exploration - How do cultured and wildtype genomes differ? - Insertions, deletion, translocations - Correlation with environmental factors Export sequence and annotation Credits: Doug Rusch and Michael Press

Fragment Recruitment Viewer Sequence Similarity Genomic Position Doug Rusch, JCVI

Sequence Similarity Genomic Position Annotation Geographic Legend

Prochlorococcus marinus str. MIT 9312 Coloring by geography 80-95% identity cloud = GOS Indian Ocean Regions with no coverage Where? Real?

Mate Status Highlights Differences Paired end (mate) sequencing Coloring by mate status Highlights cultured vs metagenomic differences Selective display of - Mates by status - Reads by sample

Mate Pairs Highlight Variation

What Genes are Involved

View by Sample

View by Sample Filter by mate status

Annotation of Environmental Shotgun Data Gene Finding - Using Yooseph’s Protein Clusters, and/or - Metagene Functional Assignment - Variation of JCVI prok annotation pipeline* - Leverages protein cluster annotation -- soon Quality Nearly Comparable to Prokaryotic Genomic Annotation

Protein Clusters as Gene Finder Identification and soft mask of ncRNAs Naïve identification of ORFs (60aa min) Add peptides to clusters incrementally - Yooseph and Li, 2008 Predicted Genes based on ORFS in - Clusters of sufficient size - Clusters that satisfy additional filters

Protein Clusters Advantages and Disadvantages Weaknesses - Homology-based - Stateful (also a strength) - Less sensitive (for now) Strengths - More specific - Transitive Annotation - Learns over time - Easy to maintain

Search for Dehalogenase

Browse Clusters

Near Future More extensive data collection Summary views of data sets by - Annotation - Samples - Mate Status - Taxonomy - Habitat and other contextual metadata 16S datasets?

Credits JCVI CAMERA Team - Leonid Kagan, Michael Press, Todd Safford, Cristian Goina, Qi Yang, Sean Murphy, Jeff Hoover, Tanja Davidsen, Ramana Madupu, Sree Nampally, Nikhat Zhafar, Prateek Kumar - Doug Rusch, Shibu Yooseph, Aaron Halpern*, Granger Sutton, Shannon Williamson - Marv Frazier and Bob Friedman Calit2 CAMERA Team - Adam Brust, Michael Chiu, Brian Fox, Adam Dunne, Kayo Arima - Larry Smarr and Paul Gilna