KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Ch. 17 Basic Statistical Models CIS 2033: Computational Probability and Statistics Prof. Longin Jan Latecki Prepared by: Nouf Albarakati.
Sampling: Final and Initial Sample Size Determination
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Sampling distributions of alleles under models of neutral evolution.
Next Generation Sequencing, Assembly, and Alignment Methods
Heavy hitter computation over data stream
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Sample size computations Petter Mostad
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Introduction to Statistics: Chapter 8 Estimation.
Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
Lecture 6 Data Collection and Parameter Estimation.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Correlation Coefficients Pearson’s Product Moment Correlation Coefficient  interval or ratio data only What about ordinal data?
Lecture II-2: Probability Review
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Multiple testing correction
LECTURE 17 THURSDAY, 2 APRIL STA291 Spring
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
Estimation in Sampling!? Chapter 7 – Statistical Problem Solving in Geography.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Sampling Distributions & Standard Error Lesson 7.
A statistical base-caller for the Illumina Genome Analyzer Wally Gilks University of Leeds.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Hung X. Nguyen and Matthew Roughan The University of Adelaide, Australia SAIL: Statistically Accurate Internet Loss Measurements.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Lab 3b: Distribution of the mean
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Comp. Genomics Recitation 3 The statistics of database searching.
Lecture 2 Forestry 3218 Lecture 2 Statistical Methods Avery and Burkhart, Chapter 2 Forest Mensuration II Avery and Burkhart, Chapter 2.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Chap 7-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 7 Estimating Population Values.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
The iPlant Collaborative
Review of Statistical Terms Population Sample Parameter Statistic.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
PMT time offset calibration (not completed) R.Sawada 27/Dec/2007.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Counting How Many Elements Computing “Moments”
Statistical Methods For Engineers
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Discrete Event Simulation - 4
Fast and Exact K-Means Clustering
Chapter 8: Estimating With Confidence
Sample vs Population (true mean) (sample mean) (sample variance)
TRC: Trace – Reference Compression
Presentation transcript:

KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson

Error rates vs. Quality values  What error rates can we expect from NGS  Specifically whole genome sequencing with Illumina sequencing technology  How informative are quality values  Rubbish?  Worth using for analysis?

Quality values  A probability estimate that the basecall is correct GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>C  Phred scale,  Pr[base call incorrect] ~ 10 -Q/10 !=33, bad basecall

Error rates  What percentage of basecalls are correct  How to estimate  Align reads to a reference  Count mismatches and non-alignments  Correct for snps and variants.  Reference free  Whole genome assembly?

K-mer counting  Count k-mers, want k large, say ~31. GATTTGGGGTTCAAAGCAGTA GAT ATT TTT TTG TGG... GAT 2 ATT 3 TTT 2 TTG 3 TGG 2... ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT

Errors and k-mers  Basecall errors impact many k-mers GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT... AAGCAG AGCAGT GCAGTA GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC

Errors and k-mers  Basecall errors are not independent  Multiple errors more likely  Ends of reads contain more errors  K-mer error rate underestimates true basecall error rate  Discounts reads with many errors or errors at the ends  Can be off by a factor of 2

Frequency histograms  Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1

Naïve method  Assumptions:  Sampling from a genome of size G  Poisson distribution, Poi(λ), of coverage of each position  Each k-mer sampled is an error with prob ε independently.  When we sample an error k-mer, it is replaced by a single nucleotide substitution at random

Naïve model  Probability that a k-mer has coverage 1  ε Pr[error k-mer has cov 1] + (1-ε) Pr[true k-mer has cov 1] ε1-ε TGAC TGGC Genome length G Sample random position Produce correct k-merIntroduce one error

Frequency moments  From the frequency histogram we define  f i = number of k-mers with coverage i  f 1 = number of singletons  F 0 = number of distinct k-mers = Σ f i  F1 = number of all k-mers = Σ i f i

Fitting the model  3 unknown parameters G, λ, ε  3 k-mer frequency statistics, f 1, F 1, F 0

Computing the moments  Count all k-mers? – very memory intensive  Sample k-mers (à la KmerGenie)  Streaming algorithm, KmerStream  Estimates f 1, F 0, F 1 directly without storing any k-mers  Accuracy can be specified (default ~2%)

KmerStream  Very fast, 5-10s per million reads  Low memory overhead, ~11M  One pass over the dataset  Uses hashing to sample k-mers adaptively  Lossy counting similar to Bloom filter  Does not keep track of k-mers  2-3x faster than KmerGenie, 10x better memory

Validation  Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times. KmerStream estimates True kmer counts

Real data  Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.  KmerStream run for all samples, model fit to estimate k-mer error rates for k=31

K-mer error rates

Quality cutoff  Keep only k-mers in reads where quality is above q.  Run for q = 0, 13, 20, 30.  Should correspond to upper bound on error of 1.0, 0.05, 0.01, 0.001

K-mer error rates Moving from q0 to q13 huge improvement q20 to q30 not recommended, 50% samples increased error rate

Wrap up  Quality values are informative  Can get speed up by prioritizing processing based on quality values e.g. alignment  Error rates are highly variable  Quality value cutoffs can be done on a case by case basis with minimal overhead.

Thank you  Paper on bioRxiv  Code on github.com/pmelsted/KmerStream  Ph.D. position available “Streaming algorithms for whole genome assembly.”