Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro 2 1 University of Helsinki 2 University of Chile
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections2 Compressed Self-Indexes Combine a text and its full-text index. Data structures supporting several operations: Count the number of occurrences of a pattern. Locate the occurrences. Display a part of the text. Often require space proportional to the high-order entropy of the text. Many applications: text databases, pattern discovery, sequence analysis, information retrieval, data mining…
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections3 Highly Repetitive Collections Collections of highly similar sequences such as individual genomes. Possibly gigabytes or terabytes in size. Entropy is not a good measure of their compressibility. Existing self-indexes do not handle such collections well. LZ77-based compressors do (at least in principle).
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections4 Our New Self-Indexes Modified versions of existing indexes: RLCSACSA [Sadakane 2000, 2003] RLWTSSA [Mäkinen & Navarro 2004, 2005] RLFM+RLFM [Mäkinen & Navarro 2004, 2005] Based on run-length encoding of Ψ or a wavelet tree over Burrows-Wheeler transform. Main objective: overhead should be relative to the compressed size. We only consider counting queries for now.
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections5 Experimental Results: Size (MB) DNA 25 x 16 MB with mutation rate 0.001
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections6 Experimental Results: Size (MB) Source code for 75 versions of OpenSSH
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections7 Experimental Results: Counting Time (µs) Averages over 1000 patterns of length 10
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections8 Technical Details: RLCSA We use differential encoding of function Ψ such that SA[Ψ(i)] = SA[i] + 1. A run in Ψ starting at position i becomes Ψ(i) – Ψ(i – 1) followed by a run of 1s. Run-length encoding is used on the runs of 1s. The resulting integers are encoded using δ-coding. The encoding takes R (δ(σn / R) + δ(n / R)) bits, where δ(p) = log p + O(log log p).
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections9 Technical Details: RLCSA Absolute values of Ψ are sampled once every B bits of compressed data. The samples take O((|Ψ| / B + σ) log n) bits. To retrieve Ψ(i), we first binary search the samples and then scan through the sequence of differences. Count(P) queries take O(|P| (log (|Ψ| / B) + B)) time by using backward searching.
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections10 Runs in Ψ and BWT A natural compressibility measure: we are using run-length encoding on the runs! Bounded by high-order entropy: R(T) ≤ nH k (T) + σ k. Not that interesting, as R(T) ≤ n in any case. Useful for highly repetitive collections: An edit operation creates O(log σ n) new runs (expected case). Experiments suggest the bound is loose. Edit operations include duplications, point mutations, insertions, deletions, translocations, LZ77 phrases…
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections11 Edit Operations: Duplication A text and its Burrows-Wheeler transform: CCAATTGACAT$ T C G C A C A $ T A T A We append a duplicate:CCAATTGACAT$ TT CC GG CC AA CC AA $$ TT AA TT AA Another duplicate: CCAATTGACAT$CCAATTGACAT$CCAATTGACAT$ TTT CCC GGG CCC AAA CCC AAA $$$ TTT AAA TTT AAA
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections12 Edit Operations: Point Mutation A mutation occurs: CCAATTGACAT$CCAATGGACAT$CCAATTGACAT$ Contexts containing the mutation change: CCAATTGACAT$CCAATGGACAT$CCAATTGACAT$ BWT changes: TTT CCC GGG CCC AAA CCC AAA $$$ TTT AAA TTT AAA TTT CCC GGG CCC AAA CCC AAA $$$ TGTT AAA TT AAA
Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections13 Future Work How to support locate and display? Space efficient construction? The collection might not fit into memory! Suffix tree operations? Niko Välimäki will discuss some of these problems on Thursday.