TRC: Trace – Reference Compression

TRC: Trace – Reference Compression
Roye Rozov

Review: Bloom Filters memory efficient probabilistic data structure
bit array of size m, h hash functions, n entries operations: insert(s) - mark k positions hashed to with 1 query(s) - if k positions are marked 1, BF ‘accepts’ s due to possibility of collisions, probability of false positive accept F≈ (1-e-hn/m)h

Review: Bloom Filter application in assembly
assembling large genomes has become very memory intensive (often requiring > 100GB) methods typically employ de Bruijn graphs (dBGs), where kmers are nodes and k-1 overlaps between nodes are edges bottleneck is typically holding ~109 kmers in memory for graph traversal or kmer counting Draw example with read of length 5, k = 3

Mention several places you may need to store kmer reps – to get strings from kmers using overlaps, to correct errors/ remove tips using counts Also mention may store only nodes to save on memory

BF based traversal – using 8 possible queries for each node’s extensions (given a starting point)
Chengxi ye et al., Exploiting sparseness in de novo genome assembly, 2012

Minia assembler Punch-line: reduction of memory use from > 100 to ~5 GB

Review: Compression of NGS reads
NGS throughput has grown at a staggering pace over the last decade -- now common practice to send physical disks instead of sharing files electronically as raw data files are too large, 1000 G data available on AWS to avoid downloads large costs associated with long term data storage and transfer have increased interest in new NGS compression methods several recent reference based (CRAM, Goby, quip (-r)) and reference free methods (quip, SCALCE, BEETL) We focus on reference free lossless compression of read sequences ignore “N” characters and multiplicity information

Reference-based vs. reference-free approaches
Reference-based – align reads with reference and encode diffs Reference-free – boosting schemes to general (LZ/Huffman coding/arithmetic coding based) compressors, graph/assembly based schemes

do not (0). There is no try.”
Star Wars: Attack of the Clones TRC Intuition: BF as a Read Oracle; store only a trace of the reads Reads BF 1 . queries: AA…….A? AA…….C? … TT……..T? “Do (1) or do not (0). There is no try.”

Problems with querying an oracle
Time: when decoding, there are 4len possible queries where len is ~100 Space: FPs need to be stored explicitly; their number is proportional to the number of queries performed the search space must be limited

Use reference genome Substitutions checked at e positions on right end, reverse complement strand also scanned FPs/FNs stored in additional data structure for decoding

Genome position queries
BF 1 TP [-e-] FP ---- l ---- TN Reference genome ….. ….ACGTT….. FN From reads: unchecked mutations, indels

Bloom filter cost r = m/n (bits per BF entry) minimized with h = (m/n) * ln(2); r = O(log(1/F)) At this value F≈ (1-e-hn/m)h = (1/2)h ≈0.6185r Mention r scales as log_2 F; smallest BF size is least compressible

Cascading BFs BF1 BF2 BF3 BF4 …… n 2GF nF 2GF2
Total Cost = (nr + 2GFr )(1 + F2 + F3 + …) Assuming F = cr (corr. to minimum cost for standard BFs), Bits per element ≈ (1+(2Gcr/n))(r/ 1-cr ) For decoding, s is a read iff the last BF that accepts it is odd numbered

Coverage 10 20 30 40 50 Quip time (min) 1.5 2.9 4.5 6.2 7.2 Fastqz time (min) 7 12 16 22 36 TRC –e 20 time (min) 68 72 74 75 Quip –r time (min) 33 69 104 143 175 Quip ratio 0.22 0.21 Fastqz ratio 0.15 0.13 0.12 TRC –e 20 ratio 0.10 0.11 Quip –r ratio 0.043 Improve on reference based b/c of time to align, better compression than reference free TRC – us; Quip default – reference free, LZ type; Quip –r – reference based; fastqz – reference free, LZMA

FPs/FNs for 1 BF, e = 20, 2 BFs coverage BF comp.size (MB) |reads| (M) |Unq| (M) |TP| (M) |FP| (M) |FN| (M) bits / read bits/unq read 10 17.4 6.3 6.2 5.2 0.63 1.0 73.8 75.0 20 31.9 12.6 12.2 10.2 1.3 2.0 72.6 30 48.5 18.9 18.0 15.0 1.7 3.0 70.3 40 61.9 25.2 23.6 19.7 2.3 4.0 69.7 74.4 50 111.8 31.5 29.1 24.1 2.5 5.0 76.0 82.3 Most of cost due to FPs/FNs – nearly one third of reads FNs – depends on error rate

Error free results – varying coverage
|R| (M) |Unq| (M) observed cBF +FP cost (MB) |FP_i| expected bits per read observed bits per unique read 10 6.3 6.1 10.4 3716 11.5 13.6 20 12.6 11.9 15.7 6266 9.76 10.6 30 18.9 17.5 20.6 8298 8.72 9.42 40 25.2 22.7 23.0 21382 7.98 8.11 50 31.5 27.6 29.2 14687 7.42 8.46 Run time in each case: ~7 mins

Next steps Apply reference free compression tool to FPs/FNs
Analyze optimal parameter choice for cBF, currently used standard BF Optimize choice of r and e together Reference free idea – for chosen k <= 15, store k-mer counts along with BFs to take the place of reference Using counts distribution and each kmer as a read start, guess next base until full read length Need to show this it is feasible to recover a read with a small number of guesses, to avoid overabundance of FPs

TRC: Trace – Reference Compression

Similar presentations

Presentation on theme: "TRC: Trace – Reference Compression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TRC: Trace – Reference Compression

Similar presentations

Presentation on theme: "TRC: Trace – Reference Compression"— Presentation transcript:

Similar presentations

About project

Feedback