Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.

Slides:



Advertisements
Similar presentations
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Advertisements

Indexing DNA Sequences Using q-Grams
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
Tries Standard Tries Compressed Tries Suffix Tries.
1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Modern Information Retrieval
BTrees & Bitmap Indexes
The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Lossless Data Compression Using run-length and Huffman Compression pages
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Prague.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
B-Tree – Delete Delete 3. Delete 8. Delete
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Burrows-Wheeler Transformation Review
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Indexing Graphs for Path Queries with Applications in Genome Research
Text Indexing and Search
Indexing Goals: Store large files Support multiple search keys
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Approximate Matching of Run-Length Compressed Strings
Reducing the Space Requirement of LZ-index
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Context-based Data Compression
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
RUM Conjecture of Database Access Method
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Highly Compressed 82MB 1 =---====""- ·-*i.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro 2 1 University of Helsinki 2 University of Chile

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections2 Compressed Self-Indexes Combine a text and its full-text index. Data structures supporting several operations: Count the number of occurrences of a pattern. Locate the occurrences. Display a part of the text. Often require space proportional to the high-order entropy of the text. Many applications: text databases, pattern discovery, sequence analysis, information retrieval, data mining…

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections3 Highly Repetitive Collections Collections of highly similar sequences such as individual genomes. Possibly gigabytes or terabytes in size. Entropy is not a good measure of their compressibility. Existing self-indexes do not handle such collections well. LZ77-based compressors do (at least in principle).

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections4 Our New Self-Indexes Modified versions of existing indexes: RLCSACSA [Sadakane 2000, 2003] RLWTSSA [Mäkinen & Navarro 2004, 2005] RLFM+RLFM [Mäkinen & Navarro 2004, 2005] Based on run-length encoding of Ψ or a wavelet tree over Burrows-Wheeler transform. Main objective: overhead should be relative to the compressed size. We only consider counting queries for now.

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections5 Experimental Results: Size (MB) DNA 25 x 16 MB with mutation rate 0.001

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections6 Experimental Results: Size (MB) Source code for 75 versions of OpenSSH

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections7 Experimental Results: Counting Time (µs) Averages over 1000 patterns of length 10

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections8 Technical Details: RLCSA We use differential encoding of function Ψ such that SA[Ψ(i)] = SA[i] + 1. A run in Ψ starting at position i becomes Ψ(i) – Ψ(i – 1) followed by a run of 1s. Run-length encoding is used on the runs of 1s. The resulting integers are encoded using δ-coding. The encoding takes R (δ(σn / R) + δ(n / R)) bits, where δ(p) = log p + O(log log p).

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections9 Technical Details: RLCSA Absolute values of Ψ are sampled once every B bits of compressed data. The samples take O((|Ψ| / B + σ) log n) bits. To retrieve Ψ(i), we first binary search the samples and then scan through the sequence of differences. Count(P) queries take O(|P| (log (|Ψ| / B) + B)) time by using backward searching.

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections10 Runs in Ψ and BWT A natural compressibility measure: we are using run-length encoding on the runs! Bounded by high-order entropy: R(T) ≤ nH k (T) + σ k. Not that interesting, as R(T) ≤ n in any case. Useful for highly repetitive collections: An edit operation creates O(log σ n) new runs (expected case). Experiments suggest the bound is loose. Edit operations include duplications, point mutations, insertions, deletions, translocations, LZ77 phrases…

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections11 Edit Operations: Duplication A text and its Burrows-Wheeler transform: CCAATTGACAT$ T C G C A C A $ T A T A We append a duplicate:CCAATTGACAT$ TT CC GG CC AA CC AA $$ TT AA TT AA Another duplicate: CCAATTGACAT$CCAATTGACAT$CCAATTGACAT$ TTT CCC GGG CCC AAA CCC AAA $$$ TTT AAA TTT AAA

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections12 Edit Operations: Point Mutation A mutation occurs: CCAATTGACAT$CCAATGGACAT$CCAATTGACAT$ Contexts containing the mutation change: CCAATTGACAT$CCAATGGACAT$CCAATTGACAT$ BWT changes: TTT CCC GGG CCC AAA CCC AAA $$$ TTT AAA TTT AAA TTT CCC GGG CCC AAA CCC AAA $$$ TGTT AAA TT AAA

Jouni Sirén, Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections13 Future Work How to support locate and display? Space efficient construction? The collection might not fit into memory! Suffix tree operations? Niko Välimäki will discuss some of these problems on Thursday.