18-21 August 2009 METAGENOMIC WORKSHOP James R. Cole, Ph.D. Ribosomal Database Project Center for Microbial Ecology Michigan State University
18-21 August 2009 RDP Pyrosequencing Pipeline Tools for high-throughput analysis
18-21 August 2009 Additional Functions Shannon Index Rarefaction Alignment Merger Estimate S SPADE Phylip Chao 1 Estimate Library Compare Dereplicate PAST R Mothur Many others compatible! Export Formats for Common Tools
18-21 August 2009
SPADE
18-21 August 2009 PAlaentologicalSTatistics
18-21 August 2009 R
Cluster Based Method
18-21 August 2009 TM7 Clostridia Unclassifed Bacteria Actinobacteria Bacteroidetea Acidobacteria Unclassifed Proteobacteria Deltaproteobacteria Gammaproteobacteria Verrucomicrobia Bacilli Planctomycetes Gemmatimonadetes Unclassified Firmicutes Betaproteobacteria Alphaproteobacteria TM7 Clostridia Unclassifed Bacteria Actinobacteria Bacteroidetea Acidobacteria Unclassifed Proteobacteria Deltaproteobacteria Gammaproteobacteria Verrucomicrobia Bacilli Planctomycetes Gemmatimonadetes Unclassified Firmicutes Betaproteobacteria Alphaproteobacteria Position by cluster order (thousands) Species Genus Family Novel Species Genus Family Novel Pigeon Pea Bare Fallow Similarity Total abundance
18-21 August 2009 Pipeline Performance Processing Time –52 samples, 350, FLX reads Classifier ~ 2 CPU hrs. Aligner ~12 CPU hrs. Clustering ~2 CPU hrs. (depends on sample sizes) SeqMatch ~23 CPU hrs.
18-21 August 2009 Usage Stats 380 users since June 2008 April 2009 stats: –182 initial process jobs –1243 cluster jobs –832 alignment jobs –>11 million sequences aligned RDP Pyro tools distributed to several major institutions
18-21 August 2009 Analysis of 16S Variable Regions Important features
18-21 August 2009 v6 v4 v3 v1v2 rRNA Gene Regions Processed by the RDP Pyrosequencing Pipeline 5’5’ 3’ 16S rRNA Gene % of Sequence Covering Position
18-21 August 2009 V6 V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Statistics from 300,000 Sanger Sequences (RDP release 10.11) Secondary-structure figures from
18-21 August 2009 V3 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32%
18-21 August 2009 V4 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32%
18-21 August 2009 V1 V2 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32%
18-21 August 2009 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Chance a Species’ Sequence is Identical to at Least One Other Species Based on 6,841 bacterial species type strain sequences Strain information from “The Living Tree Project” projects/living-tree/
18-21 August 2009 Chance Two Operons Differ in One Organism V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Based on 561 completed genome sequences with two or more rRNA operons
18-21 August 2009 V4 SangerFLX Avg. Size207 % Missing Pairs 0.3x x10 -4 % Half Pairs 8x x10 -2 % Paired62 % Aligned99% Quality of Recovered Structure
18-21 August 2009 V4 SangerFLX Avg. Size207 % Missing Pairs 0.3x x10 -4 % Half Pairs 8x x10 -2 % Paired62 % Aligned99% Quality of Recovered Structure
18-21 August 2009 Introduction to the Short Read Archive (SRA) myRDP SRA Prepkit
18-21 August 2009 SRA Submission Format
18-21 August N StudyExperiment AnalysisRunSample 1 1 N N N 1 Submission Six Different SRA Document Types
18-21 August 2009 myRDP SRA Prepkit myRDP SRA PREPKIT SEQUENCE READS XML DOCUMENTS NCBI-SRA EMBL-ERA METADATA SEQUENCING PROJECT myRDP SWS SUBMIT
18-21 August 2009
Sample Attributes Prefilled Genomic Standards Consortium MIMS (Minimal Information about a Metagenome Sequence)* *Nature Biotechnology 26, (2008)
18-21 August 2009 Functional Genes
18-21 August 2009 FGPR Home Page Screenshot
18-21 August 2009 FGPR Screenshots seed sequences active links to GenBank records active links to GenBank records organism name display/filter options custom analysis
18-21 August 2009 Functional Gene Pipeline/Repository Sequence Analysis interactive commands sub-selection for further analysis sub-selection for further analysis dynamic tree applet
18-21 August 2009 Functional Gene Processing 1)Remove Frameshifts 1)tBLASTX 2)GeneWise 2)Translate and align sequences 1)HMMER 2)MUSCLE 3)Determine conserved residues 1)Entropy plot 4)Compare to reference sequences 1)Determine functional subclass
18-21 August 2009 Entropy (Dioxygenease Genes)
18-21 August 2009 Interactive distance matrix display Couples matrix with taxonomy information Allows rapid detection of taxonomic inconsistencies Taxomatic: Interactive Taxonomy Explorer
18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer
18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer
18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer
18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer
18-21 August 2009 zoom and pan
18-21 August 2009 Can zoom down to individual sequences
18-21 August 2009 Megan Taxonomic analysis through metagenomic data
18-21 August 2009
Megan Modified k-nn LCA taxonomic classifier Requires BLAST result file Extracts taxonomy, cogs from matches Features from NCBI Prokaryotic Attributes Table
18-21 August 2009 MEGAN Screenshot1
18-21 August 2009 MEGAN Screenshot2
18-21 August 2009 MEGAN Screenshot3
18-21 August 2009 Metagenomics Analysis Pipelines Sequence Comparison
18-21 August 2009 General Considerations What databases are used? –GenBank nr (not good) –Pfam, TIGRfam, FIGfam? What search strategy is used? –BLAST, HMMER, Additional tools? Will they process my data –Will my data become public
18-21 August 2009 HMMER vs BLAST
18-21 August 2009 BMC Genomics Aziz
18-21 August 2009 The SEED & RAST Subsystems: Pathway database –Expert annotation –Curated simultaneously across many genomes FIGfams: Database of protein families – Derived from Subsystems database –Controlled addition of new family members RAST: Genome annotation system –Uses FIGfams for gene annotation –Uses Subsystems for pathway annotation
18-21 August 2009 The SEED & RAST
18-21 August 2009 fromPDF
18-21 August 2009 BMC RAST Fig. 2
18-21 August 2009 BMC RAST Fig. 4
18-21 August 2009
JGI’S IMG/M HOME
18-21 August 2009
CAMERA HOME
18-21 August 2009 CAMERA DASHBOARD
18-21 August 2009 CAMERA PROJECT SAMPLES
18-21 August 2009 Metadata Data about data
18-21 August 2009 Metadata Standards Minimum Information about a Microarray Experiment (MIAME) Minimum Information about a genome sequence (MIGS) Minimum Information about a metagenome sequence (MIMS)
18-21 August 2009 Nature Biotechnology 26, (2008)
18-21 August 2009 MIMS extension: select to report a set of uniform measurements for a given habitat: Water body: (temperature, pH, salinity, pressure, chlorophyll, conductivity, light intensity, dissolved organic carbon (DOC), current, atmospheric data, density, alkalinity, dissolved oxygen, particulate organic carbon (POC), phosphate, nitrate, sulfates, sulfides, primary production) (integer, unit) Box 1 Minimum Information about a Genome Sequence (MIGS): Habitat Specific Attributes
18-21 August 2009 To help establish a set of suggested attributes for soil sequence data In cooperation with: - The Genomic Standards Consortium - The International Soils Metagenome Sequencing Consortium (Terragenome) Soil Metadata Survey
18-21 August 2009 Soil Metadata Survey Summary Not Difficulty to obtain Importance Very Easy Hard
18-21 August 2009 Soil Metadata Survey Summary Not Difficulty to obtain Importance Very Easy Hard VERY IMPORTANT / EASY TO OBTAIN -- Chemical: pH (in water or Calcium chloride) Biological: plant cover (native) Soil/Geological: horizon Geographical: latitude and longitude, elevation Management: land use (e.g., urban, agri- culture, forestry), tillage (type), crops (current, rotation), fertilizers (type and annual amount) Climate: mean and seasonal rainfall, mean and seasonal temperatures Sampling: depth, composite design, moisture content at sampling area represented by composite sample, weight of sample used for DNA extraction
18-21 August 2009 Technology Issues Limitations of Pyrosequencing
18-21 August 2009 Gomez-Alvarez ISME Article
18-21 August 2009 Gomez-Alvarez Fig. 1 Figure 1 (a) Alignment of five sequences in a cluster demonstrates the types of sequencing errors and length variation (highlighted in gray) included in a cluster. (b) Number of reads in a cluster versus the cluster number, ordered from the largest to smallest sized cluster; both axes are plotted on a log 10 scale. (c) The best BLAST match and COG affiliation for four of the most abundant clusters in replicate soil metagenomes. (d) Distribution of exact duplicate and all replicate reads in a metagenomic dataset from soil (this study) and seawater metagenomes (Frias-Lopez et al., 2008; Mou et al., 2008). *Rep, technical replicates; +Sp, biological replicates. The number of reads in each category is presented in Table 1.
18-21 August 2009 Gomez-Alvarez Table 1 (left) Gomez-Alvarez, V., Teal, T.K., Schmidt, T.M. (July 2009) Accurate determination of microbial diversity from 454 pyrosequencing data. ISME Journal advance online publication. doi: /ismej Table 1 Total numbers of reads, exact duplicates and all replicate sequences, including duplicates, from representative metagenomic data sets Habitat (metagenome) Number of reads
18-21 August 2009 PyroNoise Article
18-21 August 2009 Pyro Fig. 1 Figure 1 | OTU number as a function of percentage sequence difference for 90 pyrosequenced 16S rRNA gene clones of known sequence. (a,b) Results are repeated for complete linkage (a) and average linkage algorithms (b).
18-21 August 2009 Pyro Fig. 2 Figure 2 | Proportion of sequences assigned to the correct OTU as a function of percentage sequence difference for pyrosequenced 16S rRNA gene clones of known sequence. (a,b) Results are repeated for complete linkage (a) and average linkage algorithms (b).
18-21 August 2009 Pyro Table 1 Quince, C., Lanzén, A., Curtis, T.P., Davenport, R.J., Hall, N., Head, I.M., Read, L.F., and Sloan, W.T. (2009) Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods Advanced Online Publication Aug doi: /NMETH.1361