Independent scientist

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
BLAST Sequence alignment, E-value & Extreme value distribution.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Detection of chimeric sequences from PCR artefacts
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Heuristic alignment algorithms and cost matrices
Similar Sequence Similar Function Charles Yan Spring 2006.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Sequence alignment, E-value & Extreme value distribution
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Evaluating Classifiers
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Chapter 14: DNA Amplification by Polymerase Chain Reaction.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Modelling protein tertiary structure Ram Samudrala University of Washington.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Accurate estimation of microbial communities using 16S tags
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria.
Robert Edgar Independent scientist
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Rob Edwards San Diego State University
7. Performance Measurement
Lesson: Sequence processing
Micelle PCR reduces artifact formation in 16S microbiota profiling
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Pre-assembly analyses
Research in Computational Molecular Biology , Vol (2008)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Welcome to Introduction to Bioinformatics
Design and Analysis of Single-Cell Sequencing Experiments
Ab initio gene prediction
Independent scientist
Independent scientist
Dr Tan Tin Wee Director Bioinformatics Centre
Indexing 1.
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Benchmarks of OTU picking tools on artificial communities.
Precision and Recall.
More on Maxent Env. Variable importance:
Presentation transcript:

Independent scientist STAMPS 2017 Robert Edgar Independent scientist robert@drive5.com www.drive5.com Chimeras Chime ras

Chimeras Created during PCR Fragment primes different extension A B A Fragment of A anneals to B and primes synthesis of new strand A B A B Chimeric A+B template, amplified in following rounds of PCR

Chimeras Annealing requires complementary bases Cross-over at conserved, homologous locus Chimeras align well to known sequences Hard to distinguish from biological variants

Chimeras in practice Frequency depends on PCR conditions choice of polymerase, template concentration also on community structure (less so) Typical frequencies 5% of reads 50% of clusters 10%+ of OTUs after filtering by UCHIME Lower freq. possible but unusual "Extreme" mock community (DADA2 paper) Each strain PCR'd separately

Most chimeras are bi Bimera=2 segs, trimera=3... >2 form when parent is chimeric Lahr & Katz (2009) found many 3+ in 700bp amplicons Possibly fakes? Very rare in V4 (250bp) >2 almost always singleton reads which should be discarded before clustering anyway Lahr & Katz (2009) doi 10.2144/000113219

Detection algorithms "Reference" "De-novo" Reference database provided by the user Ideally should be free of chimeras can be a circular problem... "De-novo" Database constructed from sequences in the reads

Chimera detection algorithms û û û û

UCHIME2 Update of UCHIME uses top hit as a control new modes = heuristics + parameter settings

UCHIME2 algorithm

Perfect and fake models Perfect model if identical to query query may or may not be chimeric Fake model if query not chimeric & score > 0 model is better match than top hit Perfect fake if not chimeric & exact match Fake and perfect fake models very common Error-free prediction impossible in principle!

Fake models If query is not chimeric and is 97% identical to ref. db., 48% probability of a perfect fake. At 99% id, almost always a perfect fake, so better coverage makes problem worse!

Goals for chimera filtering How to compromise FPs and FNs? OTU pipeline, 97% clusters Chimera >3% diverged harmful always causes spurious OTU Chimeras <3% diverged can also be harmful sometimes cause spurious OTU 3% A B A+B chimera

Goals for chimera filtering False positives: discard good OTUs False negatives: cause spurious OTUs FPs and FNs equally harmful Not typical for bioinformatics! Sensitivity of 90% sounds good, but... 90% sensitivity = 10% FNs hundreds or thousands of spurious chimeric OTUs

Chimera divergence Parent divergence Top-hit divergence PD = 100% – (parent identity) If similar parents, harder to detect Chimera can very similar to one parent even if large PD Top-hit divergence D = nr. diffs between chimera & top hit Better indicates hard to detect (small D) De novo: top hit usually a parent

Chimera divergence Low-divergence chimeras most common hardest to detect Majority have D < 10, most common is D=1 D = nr. diffs with top hit

Measuring accuracy ChimeraSlayer & UCHIME benchmark Sensitivity to simulated bimeras parents always in reference database not realistic! coverage is sparse in practice Error rate false positives on leave-one-out test not realistic!

New benchmark design Measure dependence of accuracy on: D = divergence, especially small D S = similarity to closest reference sequence Sensitivity when: "Step-parent" for segment is 100%, 99% ... 90% id (S) False-positives when: Closest reference sequence is 100%, 99% ... 90% id (S) Focus on short tags (V4, ITS)

New benchmark design Split reference db. into subsets X and Y so that top hit similarity X ↔ Y = S, e.g. S=95% Make simulated bimeras C from parents in Y with divergences D = 1%, 2% ... 10% Measure TPs with query=C, db=X Measure FPs with query=Y, db=X

V4,S = 95%

Benchmark results

Benchmark results

Benchmark results High error rates if parents not in db Should use largest possible db (SILVA 1.8M) Gold (5k) misguided default for CS &UCHIME

Reference and/or de-novo? At <100% identity, fake models common All databases have sparse coverage Even SILVA Reference mode has high error rates De-novo on filtered reads also high error rates Because even low error rate degrades accuracy De-novo on denoised reads very effective Must find all exact models Conceptually different algorithm from classic UCHIME

97% OTU clustering: use UPARSE Better than UCHIME2 for 97% OTUs No need to distinguish read errors from low-div chimeras

Chimeras and denoising Ideal algorithm Reads Remove sequencing errors Amplicon sequences Remove PCR errors (point errors and chimeras) Biological sequences But, in practice can't distinguish PCR errors from sequencing error.

Chimeras and denoising Practical algorithm Reads Remove sequencing errors and PCR point errors Pseud0-amplicon sequences (Kludge, these sequences don't exist in reality) Remove chimeras Biological sequences

Abundance skew Suppose x is parent with lower abundance X is amplified in all N rounds Chimera formed in k-th round starts with abundance 1 Amplified in subsequent N - k round

Abundance skew Parents have higher abundance than chimera (Abundance of least abundant parent) / (chimera abundance) Minimum value is 2 Chimera forms in first round Parent has abundance = 1 Expect skew > 2 for almost all chimeras Chimeras unlikely for very low-abundance parent At most 1/N chimeras form in first round Probably fewer because lower amplicon density

UNOISE chimera filter Require exact match to 2-seg model Skew ≥ 2 (UNOISE2) Skew ≥ 16 (UNOISE3) Increased because I now believe reduces false positives due to fake models

DADA2 chimera filter Exact match to 2-seg model (div < 3) Zero or one diffs with 2-seg model (div ≥ 3) Abundance skew > 1

Abundance skew distribution

Denoiser false positives

Difference matters in practice Abundance skew Chimeras in soil dataset (from Edgar 2016, UCHIME2 paper)

Denoiser false positives Expected drop in frequency with lower ab. skew Almost all false positives due to fake models!