Independent scientist

Slides:



Advertisements
Similar presentations
Mutation Analysis Server Nagarajanlab. © Copyright 2005, Washington University School of Medicine. 2 Agenda Mutation pipeline overview High level design.
Advertisements

We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
SOLiD Sequencing & Data
Introduction to Short Read Sequencing Analysis
Review: What influences confidence intervals?
Sequence similarity.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 7 Correlational Research Gay, Mills, and Airasian
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
CS 6293 Advanced Topics: Current Bioinformatics
Chapter 5 Multiple Sequence Alignment.
Introduction to Short Read Sequencing Analysis
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Common parameters At the beginning one need to set up the parameters.
Construction of Substitution Matrices
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Quality Control Hubert DENISE
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Fish -Nearly half of all vertebrates species : marine, freshwater species (Fish base) FISH-BOL(Fish Barcode of Life Initiative) -Establish.
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
GG 313 beginning Chapter 5 Sequences and Time Series Analysis Sequences and Markov Chains Lecture 21 Nov. 12, 2005.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
4-1 MGMG 522 : Session #4 Choosing the Independent Variables and a Functional Form (Ch. 6 & 7)
Canadian Bioinformatics Workshops
Convenience Sample of 4 Adults and 6 Infants. Adults 4 visits over 2 weeks; infants 2 visits over 2 weeks Adult specimens: 1) plaque (by method, teeth,
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Introduction to Illumina Sequencing
Robert Edgar Independent scientist
16S rRNA Experimental Design
Next-generation sequencing technology
Confidence Intervals Cont.
DNA Sequencing Second generation techniques
454 GS FLX Ion Torrent PGM Illumina MiSeq DNA Input
GENETIC MARKERS (RFLP, AFLP, RAPD, MICROSATELLITES, MINISATELLITES)
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Metagenomics: From Bench to Data Analysis 19-23rd September S rRNA-based surveys for Community Analysis: How Quantitative are they? Dr.
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Biases and their Effect on Biological Interpretation
Sequencing technologies
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
04/10/
Presented By: Chinua Umoja
EDNA analyze Wang Ying & Huang Junman.
1. Estimation ESTIMATION.
Next-generation sequencing technology
Sequence comparison: Local alignment
Sequencing technology and assembly
The FASTQ format and quality control
2nd (Next) Generation Sequencing
Review: What influences confidence intervals?
CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis Dan Grossman Fall 2013.
Independent scientist
Identification of Paralogs in RADseq data
Independent scientist
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Fragment Assembly 7/30/2019.
MGS 3100 Business Analysis Regression Feb 18, 2016
The Variant Call Format
Presentation transcript:

Independent scientist STAMPS 2017 Robert Edgar Independent scientist robert@drive5.com www.drive5.com Next-Gen Reads

FASTQ files Text file with four lines per read Label Sequence + Q scores Format not fully standardized Different conventions for representing Q scores as letters Software may have different max & min Q scores Typical is Q2 to Q40

Illumina read Index (barcode) MiSeq#141 Spot coords. 1 = R1 2 = R2 1. Label 2. Sequence 3. + 4. Quals

Demultiplexing Assign reads to samples Using barcode sequence (tag, index) Embedded in PCR primer Incorporated into sequencing construct Illumina: usually separate read Single- or dual-index One or two index reads 454, Ion, PacBio... usually in the read

Illumina demultiplexing -rw-rw-r-- 1 robert robert 60M Apr 14 07:29 Sample1_S17_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 31M Apr 14 07:28 Sample1_S17_L001_R2_001.fastq -rw-rw-r-- 1 robert robert 43M Apr 14 07:28 Sample2_S18_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 23M Apr 14 07:28 Sample2_S18_L001_R2_001.fastq -rw-rw-r-- 1 robert robert 50M Apr 14 07:29 Sample3_S19_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 26M Apr 14 07:28 Sample3_S19_L001_R2_001.fastq Linux directory listing with read files created by MiSeq FASTQ = read file format R1 = forward read R2 = reverse read R1 + R2 for each sample Barcodes from index reads identified and discarded by Illumina software Remaining sequence is 16S (you should delete primer-binding bases)

Quality (Phred) scores Integer Q2 .. Q40 Represents P_error, probability base is wrong Q40: P_error = 0.0001 99.99% good Q30: P_error = 0.001 99.9% good Q20: P_error = 0.01 99% good Q10: P_error = 0.1 10% wrong Q3: P_error = 0.5 50% wrong Q2: P_error = 0.66 66% wrong!!

Quality filtering Discard poor-quality data Poor quality = high probability of error(s) low Q scores Genomics can mask out low-Q positions e.g. for SNP-calling

Quality filtering Amplicon sequencing different scenario Need pair-wise comparisons for most analyses pairs of reads, or reads & database to calculate identity or determine if sequences identical Masked / ambiguous positions (Ns) problematic Variable length (e.g. truncated at low Q) also problematic OTU clustering "Harmful" reads >3% errors create spurious OTUs High diversity in harmful reads Many spurious OTUs even if harmful reads small fraction

Truncating at low Q is bad idea Read quality often falls towards end of read Popular (but bad!) to truncate when Q low QIIME, sickle Do A and B have identical sequences? If Yes, dubious tail gets high abundance If No, good prefix gets low abundance

Global trimming Similar/identical reads should be globally alignable with few/no terminal gaps Comparisons unambiguous Cannot have A identical (or >97% similar) to prefix of B Unpaired reads: truncate to fixed length Trim low-quality tails Important for 454, Ion Torrent, Pac Bio Often not needed for Illumina

Global trimming Full-length amplicons with varying length ok e.g. overlapping paired reads trim to primers ok no terminal gaps when same / closely related Fwd V4 Rev Fwd V4 Rev Fwd V4 Rev Fwd V4 Rev

Quality filtering methods þ Minimum Q Ok if Q is large, e.g. Q≥20 (P_error=1%), not small e.g. Q3 Don't truncate at variable lengths Unnecessarily stringent, expected errors is better ý Average Q over sliding window Sickle Math mistake -- averaging logarithms makes no sense Errors dominated by small Qs, lost by averaging

Quality filtering methods ý QIIME filter Truncate ( ) read if >3 consecutive bases with Q≤3 Q=3 means P_error = 50% Allows reads with many errors! ý PANDAseq method t = geometric mean (!?) of P_correct along read ≥ 0.6 P_error = 0.4 Much too high, allows reads with many errors Better with higher t, but not as good as expected errors See Edgar & Flyvbjerg 2014 for details

þ Expected error filtering A T C 20 3 10 40 25 2 Avg. Q = 23 (Perror = 0.005), seems pretty good... But, the most probable number of errors = 1 because of the Q3 and Q2 bases.

Expected errors Expected errors = mean 50% errs

Expected errors Expected errors (E) in a read E = mean over large set with random errors according to Q scores real-valued (because it's an average) always > 0 can be < 1

Expected errors Surprisingly easy to calculate E Sum the error probabilities E = sum P_error Most probable number of errors E* E* = largest integer ≤ E = floor(E) Proofs in Edgar & Flyvbjerg 2014.

Expected error filter Discard reads with E>1 Keep reads with E*=0 Most probable number of errors = zero Typical performance on MiSeq 2x250 V4 keeps ~75% of the reads 2/3 of filtered reads are correct (zero errors) 1/3 have one or more bad bases

Expected error filter Works well if Q scores are accurate Illumina Q scores are pretty good 454, I-T, PacBio not so good Q scores predict indel errors, not substitutions filtering not so effective expected error filter still best method Max E=1 suggested default Not a requirement! (note for comparative validation) Larger E for less stringent filtering Smaller E for very stringent filtering

Expected error filtering Critics: allege too stringent high cost in sensitivity, diversity Reads are not lost! Most filtered reads map to OTUs after clustering Filtering is critically important to suppress spurious OTUs High sensitivity to rare species not possible Contaminants, cross-talk... Limit of resolution abundance > ~0.5% of reads

Expected vs. measured errors

Quality filter performance QIIME and PANDAseq filters leave tens of thousands of reads with >3% errors, thousands of spurious OTUs

Paired read merging / assembly

Paired read merging Two observations of each base in overlap Should increase/decrease Q if match/mismatch Use Bayes' Theorem to get posterior P_error Correct equations in Edgar & Flyvbjerg (2015) Previous papers got this wrong! Aligned region has highest quality R1s good quality R2s lower quality Position in read (MiSeq 2x300 after merging by USEARCH)

Paired read assemblers

Do we need full overlap? V4 is ~250nt 2x250 PE reads give full overlap Better error correction? Accurate OTUs with UPARSE on R2s only! Longer amplicons ok, e.g. V3-V4 (400nt) better resolution

Dereplication Find the unique sequences in the reads and their abundances Abundance is a very useful signal High-abundance sequences almost certainly correct if (and only if!) globally trimmed Errors increasingly common at lower abundances Combine reads from all samples Strongest abundance signal

Singletons Abundance = 1 Random errors usually singletons Not usually reproduced by chance Systematic errors may have ab. > 1 Polymerase errors & chimeras (amplified by PCR) Sequencing error usually pretty random

Discard singletons After filtering, many reads with >3% errors Sequencer error Polymerase copying errors Chimeras Most of these are singletons Discard singletons before clustering Necessary to minimize spurious OTUs Most singletons map to OTUs after clustering, not lost!

Discard singletons Critics: allege high cost in sensitivity, diversity Effect on sensitivity minimal / meaningless By definition, found once in one sample! Ecologically irrelevant (or not possible to interpret) Sensitivity is < 100% even with singletons Sampling effects, e.g. rare species missed Primer mismatches ("universal" = ~80% - 90%) Some / many rare species missing regardless Estimators like Chao-1 nonsense for 16S

Delete primer-binding sequences PCR tends to substitute mismatches Delete n bases if primer is length n USEARCH fastx_truncate command