Robert Edgar Independent scientist

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Metabarcoding 16S RNA targeted sequencing
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Todd J. Treangen, Steven L. Salzberg
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
BIOINFORMATICS PROGRAM St. Edward’s University Genomics Education Partnership (GEP) Genomics Consortium for Active Teaching (GCAT)
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
MERG Contents 1.Bioportal A) Registration. B) Managing projects, files, and jobs. C) Submitting / checking jobs. 2.AIR (Appender, Identifier, and Remover)
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Tutorial 3 BLAST 1. BLAST tutorial How to use BLAST Score vs. E-value Exercise Cool story of the day: How Alzheimer is studied in yeast 2.
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Metagenomic dataset preprocessing – data reduction
Canadian Bioinformatics Workshops
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
ESPRIT. Taxonomy ● Works very well and gives accurate results ● Requires a previous blast search that may take long to complete ● When in doubt goes one.
16S rRNA Experimental Design
Metagenomic Species Diversity.
Lesson: Sequence processing
Micelle PCR reduces artifact formation in 16S microbiota profiling
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Phylogeny - based on whole genome data
EDNA analyze Wang Ying & Huang Junman.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.
The Web frame for NGS output
Independent scientist
Objective of This Course
H = -Σpi log2 pi.
Independent scientist
Fast Sequence Alignments
Maximize read usage through mapping strategies
Chapter 12 Query Processing (1)
Basic Local Alignment Search Tool (BLAST)
Independent scientist
False discovery rate estimation
Ruth E. Ley, Daniel A. Peterson, Jeffrey I. Gordon  Cell 
Comparison of de novo clustering algorithms.
Basic Local Alignment Search Tool
Clustering.
Nidhi Shah University of Maryland
Presentation transcript:

Robert Edgar Independent scientist

 Data reduction  Make tractable for downstream analysis  Read dereplication & error-correction  Metagenomics  Identify protein families de novo  Community sequencing: identify OTUs

 Challenges  USEARCH solutions

Bacterial chromosome 16S gene   Primers 16S segments Environmental sample with bacteria PCR Amplified segments Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Reads

 Error correction  Chimeras  Big problem with 16S / 18S / ITS  Covered this morning: UCHIME  Other PCR errors  Sequencer error  Bad base calls, indels, homopolymers  Cluster at 97% (3% radius)  One cluster = one OTU = one species (maybe!)

Bigger dot = more reads 3% Radius 3% = species Centroid, ideally should be most abundant = most likely to be biological. Differs from rep. seq. due to: Sequencing error Biological variation

Which OTU? Ambiguous assignments

Abundant sequences <3% different 2%

Abundant sequences <3% different 2% Arbitrary choice of OTU rep. seq. Outliners create spurious OTU(s)

Full-length 16S gene (~1500nt)

Next-gen reads of hypervariable region (~300nt) Variation greater in short region, may be > 3%.

Variation between populations Healthy Diseased

Variation between populations Healthy Diseased

Bacterial chromosome 16S gene Duplication > 3% diverged Paralogs and segmental duplications Two OTUs for one species

G A T T A C A - - G A A T T A A C A Alignment variation and defining % identity G A - T T A - C A G A A T T A A C A 3 diffs or 5 diffs? No diffs or 2 diffs? Program B Program A Different programs produce different results from the same algorithm & same input data because alignments and %id definition vary. This can bias validation, e.g. Schloss & Westcott (2011) AEM.

A B C C A B 1.5% 4% 2.5% Hard to define an OTU or an optimal set of OTUs Phylogenetic tree

A B C C A B Hard to define an OTU or an optimal set of OTUs Optimal OTUs per Schloss & Westcott’s MCC measure can be non-monophyletic.

 OTUs are hacks  Do not exist in nature  Cannot be defined and validated robustly  But can still be useful!

 One program, one binary  Suite of high-throughput algorithms  Search, clustering, dereplication, chimera detection…  Orders of magnitude faster than BLAST  Free for academic use (32-bit)

 Sort sequences  Greedy list removal

Clusters Database Input sequences In RAM for fast access. Cluster assignments written sequentially to file, not stored in RAM. Typical state: one database sequence per cluster (centroid).

Clusters Database Input sequences Initial state: empty database = no clusters. Input sequences processed in file order.

Database USEARCH Clusters Next input sequence searched against database. USEARCH algorithm: very fast database search (>>BLAST). Input sequences

Clusters Hit: input sequence assigned to cluster & discarded. Database Hit Input sequences Record written to output file(s). Optional: alignment, other info.

Database No hit Clusters Input sequences No hit: query added to database, becomes centroid of new cluster.

 Very fast  Input order matters  Centroid is always first member found  How to sort?

Longest sequences typically outliers, tend to split OTUs. Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL If you don’t sort by length, fragments can become centroids and member sequences may have many differences.

Most abundant sequence is likely to be biological & a good choice of centroid

 If read errors are rare:  Abundance = size of dereplication cluster  If read errors are common:  Have a circular problem:  Abundances needs clustering, but  Clustering needs abundances.

G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus Calculate consensus sequence. UCLUST can do this for each cluster.

Dereplicate: sort by length & run UCLUST Longest sequences are centroids in first round. Tend to be outliers & split a natural OTU.

Find consensus sequences Consensus sequences converge on most abundant sequence in cluster, most likely to be a correct amplicon sequence. Common for two clusters to converge on same consensus sequence: merges an OTU that was split in first round.

Before taking consensus… …after.

Consensus sequences ≈ denoised amplicons Amplicon abundance ≈ cluster size Circular problem solved. Filter chimeras Abundances needed by de novo UCHIME as well

Sort by abundance Run UCLUST at 97% Centroid is final OTU.

Assign reads to OTUs: USEARCH at 97%. Outliers need special treatment: can be assigned to closest OTU, or reclustered at 97%. Most reads match an OTU.

 Python script, runs multiple USEARCH steps  Very fast and highly scalable  10 6 reads in minutes on a laptop  Ad hoc, but good biological results  Other algorithms are also ad hoc  Average linkage “standard” but not justified by theory  Does not address read error correction, other challenges

 Technical issues  Clustering threshold for error correction  97% seems to work well so far  But can merge distinct amplicons…  …degrades abundance estimate  Higher threshold might be better if read errors rare  Minimum cluster size threshold  Clusters <4 reads discarded after error-correction step  Rare species / false-positive trade-off

 Not like QIIME or mother  Not a complete suite of analysis tools  Not "packaged" specifically for 16S  Lower-level algorithms  Typically used by "pipelines"  Multiple steps  Typical step is USEARCH command or file conversion  Implemented by scripts (bash, perl, Python...).

TaskUSEARCH Edgar QIIME Knight mothur Schloss Pyronoise Quince Perseus Quince ESPRIT Sun reads to OTUs filtered reads to OTUs Phylotype Err. correction Chimera filter (ref db) Chimera filter (de novo) Compare pops. (UNIFRAC) Diversity (α,β)