Metagenomics
What is metagenomics Cloning genes from the environment, screening for function 16S sequencing Random community genomics Eukaryotic metagenomics
Screening from the environment Random fragments of DNA Clone into a vector Low copy vectors BACs YACs
BACs Science Creative Quarterly
Screening from the environment Random fragments of DNA Clone into a vector Low copy vectors BACs YACs Screen for a phenotype e.g. Diversa patents > 1,000 amylase genes Why did Diversa sequence whale-falls?
Screening from the environment Expression host? Pathway or single gene? Get what you select But remember … A selection is worth a thousand screens
16S sequencing Catalogs the bacteria that are present PCR amplify the 16S gene with standard primers Sequence the primers Compare to known databases
Ribosomes Ribosomes are made of proteins and RNA Prokaryotic ribosome: Large subunit: 50S 5S and 23S rRNA Small subunit: 30S 16S rRNA
30S Thermus aquaticus subunit Blue: protein Orange: rRNA
16S rRNA secondary structure E. coli 16S rRNA secondary structure Highly conserved Base pairs = stems No pairing = loops
16S rRNA secondary structure E. coli 16S rRNA secondary structure V7 V6 V5 V8 V4 V9 V3 V1 Variable regions in the 16S rRNA. Vn – 9 regions forward/rev primers V2
16S Primers 27F – 1492R full length 1,465 base pairs 967F – 1046R V6 region 1380F – 1510R V9 region 1,465 base pairs 79 base pairs 130 base pairs
Variable regions = Variable results! V1-V3 V3-V5 V6-V9
16S databases Greengenes SILVA – ARB VAMPS http://greengenes.lbl.gov/ Gary Andersen, Lawrence Berkeley National Laboratory SILVA – ARB http://www.arb-silva.de/ Frank Oliver Glöckner, MPI, Bremen, Germany VAMPS http://vamps.mbl.edu/ Mitch Sogin, Woods Hole, USA Ribosomal Database Project (RDP) http://rdp.cme.msu.edu/ James Cole, Michigan State University, USA
16S sequencing Cheap Easy Portable PCR bias Variable regions give variable answers Only tells you which organisms are present & abundance Does not explain much of the variance of the data What does 16S sequencing actually tell you?
What does 16S sequencing tell you?
What does 16S sequencing tell you?
What is metagenomics Cloning genes from the environment, screening for function 16S sequencing Random community genomics Eukaryotic metagenomics
16S sequencing is not good for functions
How much of the data? Topography of [fungi and] bacteria on the skin Study = 5,000 taxa 14 skin sites 10 people 3 skin types 5,000 variables The remainder of the variance (85.1%) is explained by a few taxa each Each dimension only adds marginal information Findley et al, Nature 2013 doi: 10.1038/nature12171 They don't explain the meaning of j-q
How much of the data? Nine biomes paper Variance: 1,040,665 reads total (from 45 samples) 30 subsystems 9 biomes 30 variables Fewer of the variables explain more of the data The variables are distinctive for each environment Dinsdale et al., Nature 2008 doi:10.1038/nature06810
Shotgun sequencing (HiSeq) Movies courtesy Will Trimble, Argonne National Labs http://www.mcs.anl.gov/~trimble/flowcell/
16S sequencing (MiSeq) Movies courtesy Will Trimble, Argonne National Labs http://www.mcs.anl.gov/~trimble/flowcell/
Shotgun + 16S (HiSeq) Movies courtesy Will Trimble, Argonne National Labs http://www.mcs.anl.gov/~trimble/flowcell/
There is no 16S for viruses Rohwer and Edwards, 2002. The phage proteomic tree. doi: 10.1128/JB.184.16.4529-4535.2002
Random community genomics 26262626 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses or bacteria Extract nucleic acids DNA/RNA LASL Epifluorescent Microscopy Sequence
How do you sequence the environment? 27272727 How do you sequence the environment? Extract DNA Soil extraction kit Water extraction kit Create library LASLs fosmids Sequence fragments
Linker-Amplified Shotgun Libraries (LASLs) 28282828 Soil Extraction Kit This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA David Mead - Breitbart (2002) PNAS
Early Attempts at a Metagenomics Platform http://phage.sdsu.edu/~rob/cgi-bin/remoteblast.cgi Submit BLAST to local and remote databases Local (as fast as possible) NCBI (one search every 3 seconds) Many concurrent searches One search versus 1,000 searches Parse data into tables for Excel Access to taxonomy etc
Human-associated viruses 30303030 Human-associated viruses More bacteria than somatic cells by at least an order of magnitude More phages than bacteria by an order of magnitude Sample the bacteria in the intestine by sampling their phage
Most Viral DNA Sequences in Adult Human Feces are Unknown Phages 31313131 Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% Phages 94% Breitbart (2003) J. Bacteriol.
Abundance of viruses in twins Reyes et al, Nature 2010
Abundance of viruses in twins Microbial samples in guts don't change very much Reyes et al, Nature 2010
Abundance of viruses in twins Phage samples in guts change a lot Reyes et al, Nature 2010
Abundance of viruses in twins Microbial Phage Reyes et al, Nature 2010
Most Human RNA Viruses are Known 36363636 Other Plant Viruses 9% Unknown 8% Other 26% Pepper Mild Mottle Virus 65% Known 92% Zhang (2006) PLoS Biology
Pepper Mild Mottle Virus (PMMV) 37373737 ssRNA virus; ≈6 kb genome Related to Tobacco Mosaic Virus Infects members of Capsicum family Widely distributed – spread through seeds Fruits are small, malformed, mottled Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http://www.rothamsted.bbsrc.ac.uk/ppi/links/pplinks/virusems/
PMMV is common in Human Feces 38383838 Fecal samples Extract total RNA RT-PCR for PMMV S1 S2 S3 S4 S5 S6 S7 S8 S9 PMMV Add san diego samples San Diego : 78% people are positive Singapore : 67% people are positive 10-50 fold increase in feces compared to food 106-109 PMMV copies per gram dry weight of feces
Which Foods Contain PMMV? 39393939 Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS Hong Kong chili sauce Hong Kong green chili Pork noodle red chili Vegetarian chili Indian curry Chicken rice Chinese food
PMMV is Present at High Concentrations in Raw Sewage and Treated Wastewater PMMoV was detected in 100% of raw sewage samples collected among the dif states and in most of the treated effluent except for FLK Note dif efficiency in removing PMMoV below detection levels of the assayn (100). Except for FLK, all effluent values were > 10^4. According to these results PMMoV cannot be used as an indicator of untreated Concentrations of PMMoV are >1 million copies per milliliter raw sewage ated wastewater or raw sewage… Rosario et al. AEM (2009) 40
Different PMMV families Lib3 Contig[0064] Lib3 Contig[0064] Lib2 Contig[0070] Lib2 Contig[0070] AB084456.1 AB084456.1 AB062049.1 AB062049.1 I I AB062051.1 AB062051.1 AF103778 AF103778 AY632863.1 AY632863.1 AB119482.1 AB119482.1 AJ429088.1 AJ429088.1 AB062054.1 AB062054.1 CoatProtein CoatProtein AB069853.1 AB069853.1 AB062052.1 AB062052.1 II II Diverse populations Differences between individuals and over time AB000709.2 AB000709.2 M87827.1 M87827.1 AJ429087.1 AJ429087.1 AF525080.1 AF525080.1 Lib2_2217 Lib2_2217 Lib3_Contig[0494] Lib3_Contig[0494] Lib3_Contig[1213] Lib3_Contig[1213] Lib2_Contig[0458] Lib2_Contig[0458] Lib2_Contig[1099] Lib2_Contig[1099] Lib3_65 Lib3_65 Lib3_Contig[0273] Lib3_Contig[0273] Lib3_Contig[0078] Lib3_Contig[0078] Lib3_Contig[0863] Lib3_Contig[0863] AJ308228.1 AJ308228.1 III III AB062053.1 AB062053.1 Same person 6 months apart AJ429089.1 AJ429089.1 X72587.1 X72587.1 Lib2_1377 Lib2_1377 Library 1 Library 2 Library 3 Lib2_2914 Lib2_2914 Lib1_2299 Lib1_2299 Lib3_928 Lib3_928 Lib2_1656 Lib2_1656 Lib2_2549 Lib2_2549 Lib3_462 Lib3_462 Lib2_492 Lib2_492 Lib3_Contig[0655] Lib3_Contig[0655] Lib2_133 Lib2_133 Lib1_Contig[0253] Lib1_Contig[0253] No colors:GenBank It’s a diverse group of PMMoV strains w/in individual and between individuals Lib1_Contig[0123] Lib1_Contig[0123] Lib1_Contig[0279] Lib1_Contig[0279] Lib1_Contig[0107 Lib1_Contig[0107 IV IV Lib1_Contig[0052 Lib1_Contig[0052 ] ] ] Lib1_Contig[0004 ] Lib1_Contig[0004 ] ] Lib2_Contig[0995] Lib2_Contig[0995] Lib1_Contig[0009] Lib1_Contig[0009] Lib1_Contig[0166] Lib1_Contig[0166] Lib1_Contig[0657] Lib1_Contig[0657] Lib1_1449 Lib1_1449 Lib1_2211 Lib1_2211 Lib1_Contig[0029] Lib1_Contig[0029] Lib1_1733 Lib1_1733 Lib1_Contig[0076] Lib1_Contig[0076] Lib1_1168 Lib1_1168 Lib1_Contig[0261] Lib1_Contig[0261] Lib1_2361 Lib1_2361 Lib2 1468 Lib2 1468 Lib2 Contig[0031] Lib2 Contig[0031] Lib2 Contig[1202] Lib2 Contig[1202] V V Lib1_Contig[0005] Lib1_Contig[0005] Lib1_Contig[0558] Lib1_Contig[0558] AF103776.1 AF103776.1 AB062050.1 AB062050.1 0.1 0.1 41
Human-fecal borne PMMV can infect plants Spread of infection to Hungarian wax pepper evident within 1 week Infected leaf was positive by RT-PCR for PMMV Animals may serve as vectors for plant viruses Fecal sample Viral concentrate Plant leaf inoculation Total RNA PMMV RT-PCR Leave pic!!! Still infectious Infected leaf Control 42
Koch’s Postulates 43434343 Thesunmachine.net http://www.sweatnspice.com
Random community genomics
Eukaryotic metagenomics ITS sequences Internal transcribed spacer regions http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4113289/ Individual genes Cox1 Exome sequencing Pull out ESTs and sequence
Why Metagenomics? What is there? How many are there? 46464646 Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations Diagnostics
Sequencing costs decreasing http://genome.gov/sequencingcosts
How much has been sequenced? 48484848 100 bacterial genomes Environmental sequencing Number of known sequences First bacterial genome 1,000 bacterial genomes Year
How much will be sequenced? 49494949 Everybody in USA Everybody in San Diego One genome from every species 100 people All cultured Bacteria Most major microbial environments Training people for the future X-Prize competition, sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome Year
Most pipelines work the same way!
Metagenomics Processing Merge paired-end reads Preprocessing Functional Assignments Taxonomic assignments Contamination removal Gene Prediction Contig Clustering Binning reads
Metagenomics Quality control – Prinseq Statistics Deconseq Annotation FOCUS Real time metagenomics mg-rast Super FOCUS Statistics STAMP Population genomes crAss metabat ContigClustering
Metagenomics Processing AbundanceBin CompostBin concoct crAss tetra Contig clustering FASTQC FastX Toolkit fitGCP NGS QC Toolkit Non-pareil Prinseq QC-Chain Streaming Trim Preprocessing FragGeneScan GlimmerMG MetaGeneAnnotator MetaGeneMark MetaGun Orphelia Prodigal Gene Prediction CARMA myTaxa FOCUS PhylopythiaS KRAKEN phymmbl LMAT RAIphy MEGAN TACOA Metaplan Taxy Taxonomic assignment CLAMS Sequedex DiScRIBinATE SORT-ITEMS genometa SPANNER GSMer SPHINX PPLACER TaxSOM RTMg Treephyler Functional assignment