Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
16S rRNA as phylogenetic marker gene Escherichia coli 16S rRNA Primary and Secondary Structure 70S Ribosome subunits 30S 50S 34 proteins 21 proteins 5S rRNA 23S rRNA 16S rRNA Falk Warnecke highly conserved between different species of bacteria and archaea
16S rRNA in environmental microbiology (Sanger clone libraries) Falk Warnecke bp length
Next generation sequencing (NGS) Illumina M 150bp reads/lane $ 0.5M 450bp reads $$ Read length Throughput
Game plan to survey microbial diversity V1V2V3V4V5V6V7V8V9 16S rRNA Reduce dataset by dereplication/clustering X 10,000 X 800 X 1,200 X 200 X 2,000 X 1,000 X 1 X 10 Identification (BLAST, RDP classifier) Generate amplicons of a given variable region from bacterial community (many millions of sequences) Amplicon tags = Deeper, cheaper, faster
Rare biosphere Rank Abundance Rare biosphere Sequencing error? Chimeras? Background noise? Relative small size of amplicons High abundance Low abundance High sequencing depth of NGS reveals “rare” OTUs
Rare bias sphere? Control experiment: estimate rare biosphere in a single strain of E.coli 27F342R1114F1392R Is rare biosphere an artifact of the NGS error? V1 & V2 V8 Kunin et al., (2009), Environ. Microbiol. It should not, if relatively stringent clustering parameters are applied Subject to controversy – Is rare always real? Quince et al., (2009), Nat. Methods
PyroTagger (for 454 amplicons) Unzip, validate Remove low-quality reads Redundancy removal PyroClust & Uclust Remove chimeras Samples comparison, post-processing pyrotagger.jgi-psf.org
Classification and barcode separation Sequences of cluster (OTU) representatives Blast vs GreenGenes and Silva databases, dereplicated at 99.5% Distribution of microbial phyla in the dataset Also see the Qiime pipeline
Illumina tags (itags) Typical 454 run 450,000 – 500,000 reads “Typical” Illumina run: GAIIx 10,000,000 – 40,000,000 reads/lane Hiseq ~ 350,000,000 reads/lane Miseq (available soon) ~4,000,000 reads/lane Move 16S tags sequencing to Illumina platform HiSeq = huge output compared to 454 (suitable for big projects indexes(barcodes)/libraries MiSeq = moderatly high throughput (More suitable?) throughput more efficient clustering algorithm (SeqObs).
Illumina tags (itags) 454 = “1” read Illumina = “2” reads => have to be assembled Both reads need to be of good quality ACGTGGTACTACGTGAT…. ~ bp ACGTGGTACTACGTGATAGTGTAT ~252 bp 454 Illumina
itags clustering Sort by alphabetical order 100% identity Reduces dataset by 80% Edward Kirton, JGI 97%
Number of reads >> number of clusters Edward Kirton, JGI Number of reads (millions) Clustering happens here!
Benefits of parallelization Processing time (min.) Number of reads (millions) Edward Kirton, JGI
MiSeq validation Exploratory experiments using 11 wetlands samples. Validate reproducibility between runs
MiSeq validation Beta diversity (UniFrac Distances) Run 1 Run 2
itags Validating SeqObs output by comparing with pyrotagger results Synthetic communities Termite gut Surface Sediments Compost Sludge 454 Pyrotagger (V8 region) Illumina GAIIx SeqObs pipeline (V4, V5 and V9 regions) Illumina Miseq SeqObs pipeline (V4 region)
Comparing 454 with illumina GAIIx vs 454 region
Comparing 454 with illumina Primer pair of variable region is likely to affect outcome of results. In silico PCR on 16S Greengenes database.
itags – confidence level bp GAIIx ~110 bp Miseq 5’ reads 150 bp Miseq assembled reads ~250 bp E values
Challenges Short size of amplicon What filtering parameters to use (stringency level)? balance between stringency filter and keeping as much data as we can Whole new dimension for rare biosphere? Handling large numbers of sample (tens of thousand magnitude) Cost of barcoded primers (will need lots of barcodes), handling Huge ammount of samples statistics models…
Acknowledgments Susannah Tringe Edward Kirton Feng Chen Kanwar Singh Rob Knight lab (Univ. of Colorado) Thanks!
16S rRNA Dangl lab, UNC