Robert Edgar Independent scientist
Data reduction Make tractable for downstream analysis Read dereplication & error-correction Metagenomics Identify protein families de novo Community sequencing: identify OTUs
Challenges USEARCH solutions
Bacterial chromosome 16S gene Primers 16S segments Environmental sample with bacteria PCR Amplified segments Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Reads
Error correction Chimeras Big problem with 16S / 18S / ITS Covered this morning: UCHIME Other PCR errors Sequencer error Bad base calls, indels, homopolymers Cluster at 97% (3% radius) One cluster = one OTU = one species (maybe!)
Bigger dot = more reads 3% Radius 3% = species Centroid, ideally should be most abundant = most likely to be biological. Differs from rep. seq. due to: Sequencing error Biological variation
Which OTU? Ambiguous assignments
Abundant sequences <3% different 2%
Abundant sequences <3% different 2% Arbitrary choice of OTU rep. seq. Outliners create spurious OTU(s)
Full-length 16S gene (~1500nt)
Next-gen reads of hypervariable region (~300nt) Variation greater in short region, may be > 3%.
Variation between populations Healthy Diseased
Variation between populations Healthy Diseased
Bacterial chromosome 16S gene Duplication > 3% diverged Paralogs and segmental duplications Two OTUs for one species
G A T T A C A - - G A A T T A A C A Alignment variation and defining % identity G A - T T A - C A G A A T T A A C A 3 diffs or 5 diffs? No diffs or 2 diffs? Program B Program A Different programs produce different results from the same algorithm & same input data because alignments and %id definition vary. This can bias validation, e.g. Schloss & Westcott (2011) AEM.
A B C C A B 1.5% 4% 2.5% Hard to define an OTU or an optimal set of OTUs Phylogenetic tree
A B C C A B Hard to define an OTU or an optimal set of OTUs Optimal OTUs per Schloss & Westcott’s MCC measure can be non-monophyletic.
OTUs are hacks Do not exist in nature Cannot be defined and validated robustly But can still be useful!
One program, one binary Suite of high-throughput algorithms Search, clustering, dereplication, chimera detection… Orders of magnitude faster than BLAST Free for academic use (32-bit)
Sort sequences Greedy list removal
Clusters Database Input sequences In RAM for fast access. Cluster assignments written sequentially to file, not stored in RAM. Typical state: one database sequence per cluster (centroid).
Clusters Database Input sequences Initial state: empty database = no clusters. Input sequences processed in file order.
Database USEARCH Clusters Next input sequence searched against database. USEARCH algorithm: very fast database search (>>BLAST). Input sequences
Clusters Hit: input sequence assigned to cluster & discarded. Database Hit Input sequences Record written to output file(s). Optional: alignment, other info.
Database No hit Clusters Input sequences No hit: query added to database, becomes centroid of new cluster.
Very fast Input order matters Centroid is always first member found How to sort?
Longest sequences typically outliers, tend to split OTUs. Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL If you don’t sort by length, fragments can become centroids and member sequences may have many differences.
Most abundant sequence is likely to be biological & a good choice of centroid
If read errors are rare: Abundance = size of dereplication cluster If read errors are common: Have a circular problem: Abundances needs clustering, but Clustering needs abundances.
G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus Calculate consensus sequence. UCLUST can do this for each cluster.
Dereplicate: sort by length & run UCLUST Longest sequences are centroids in first round. Tend to be outliers & split a natural OTU.
Find consensus sequences Consensus sequences converge on most abundant sequence in cluster, most likely to be a correct amplicon sequence. Common for two clusters to converge on same consensus sequence: merges an OTU that was split in first round.
Before taking consensus… …after.
Consensus sequences ≈ denoised amplicons Amplicon abundance ≈ cluster size Circular problem solved. Filter chimeras Abundances needed by de novo UCHIME as well
Sort by abundance Run UCLUST at 97% Centroid is final OTU.
Assign reads to OTUs: USEARCH at 97%. Outliers need special treatment: can be assigned to closest OTU, or reclustered at 97%. Most reads match an OTU.
Python script, runs multiple USEARCH steps Very fast and highly scalable 10 6 reads in minutes on a laptop Ad hoc, but good biological results Other algorithms are also ad hoc Average linkage “standard” but not justified by theory Does not address read error correction, other challenges
Technical issues Clustering threshold for error correction 97% seems to work well so far But can merge distinct amplicons… …degrades abundance estimate Higher threshold might be better if read errors rare Minimum cluster size threshold Clusters <4 reads discarded after error-correction step Rare species / false-positive trade-off
Not like QIIME or mother Not a complete suite of analysis tools Not "packaged" specifically for 16S Lower-level algorithms Typically used by "pipelines" Multiple steps Typical step is USEARCH command or file conversion Implemented by scripts (bash, perl, Python...).
TaskUSEARCH Edgar QIIME Knight mothur Schloss Pyronoise Quince Perseus Quince ESPRIT Sun reads to OTUs filtered reads to OTUs Phylotype Err. correction Chimera filter (ref db) Chimera filter (de novo) Compare pops. (UNIFRAC) Diversity (α,β)