KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson
Error rates vs. Quality values What error rates can we expect from NGS Specifically whole genome sequencing with Illumina sequencing technology How informative are quality values Rubbish? Worth using for analysis?
Quality values A probability estimate that the basecall is correct GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>C Phred scale, Pr[base call incorrect] ~ 10 -Q/10 !=33, bad basecall
Error rates What percentage of basecalls are correct How to estimate Align reads to a reference Count mismatches and non-alignments Correct for snps and variants. Reference free Whole genome assembly?
K-mer counting Count k-mers, want k large, say ~31. GATTTGGGGTTCAAAGCAGTA GAT ATT TTT TTG TGG... GAT 2 ATT 3 TTT 2 TTG 3 TGG 2... ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT
Errors and k-mers Basecall errors impact many k-mers GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT... AAGCAG AGCAGT GCAGTA GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC
Errors and k-mers Basecall errors are not independent Multiple errors more likely Ends of reads contain more errors K-mer error rate underestimates true basecall error rate Discounts reads with many errors or errors at the ends Can be off by a factor of 2
Frequency histograms Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1
Naïve method Assumptions: Sampling from a genome of size G Poisson distribution, Poi(λ), of coverage of each position Each k-mer sampled is an error with prob ε independently. When we sample an error k-mer, it is replaced by a single nucleotide substitution at random
Naïve model Probability that a k-mer has coverage 1 ε Pr[error k-mer has cov 1] + (1-ε) Pr[true k-mer has cov 1] ε1-ε TGAC TGGC Genome length G Sample random position Produce correct k-merIntroduce one error
Frequency moments From the frequency histogram we define f i = number of k-mers with coverage i f 1 = number of singletons F 0 = number of distinct k-mers = Σ f i F1 = number of all k-mers = Σ i f i
Fitting the model 3 unknown parameters G, λ, ε 3 k-mer frequency statistics, f 1, F 1, F 0
Computing the moments Count all k-mers? – very memory intensive Sample k-mers (à la KmerGenie) Streaming algorithm, KmerStream Estimates f 1, F 0, F 1 directly without storing any k-mers Accuracy can be specified (default ~2%)
KmerStream Very fast, 5-10s per million reads Low memory overhead, ~11M One pass over the dataset Uses hashing to sample k-mers adaptively Lossy counting similar to Bloom filter Does not keep track of k-mers 2-3x faster than KmerGenie, 10x better memory
Validation Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times. KmerStream estimates True kmer counts
Real data Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage. KmerStream run for all samples, model fit to estimate k-mer error rates for k=31
K-mer error rates
Quality cutoff Keep only k-mers in reads where quality is above q. Run for q = 0, 13, 20, 30. Should correspond to upper bound on error of 1.0, 0.05, 0.01, 0.001
K-mer error rates Moving from q0 to q13 huge improvement q20 to q30 not recommended, 50% samples increased error rate
Wrap up Quality values are informative Can get speed up by prioritizing processing based on quality values e.g. alignment Error rates are highly variable Quality value cutoffs can be done on a case by case basis with minimal overhead.
Thank you Paper on bioRxiv Code on github.com/pmelsted/KmerStream Ph.D. position available “Streaming algorithms for whole genome assembly.”