Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Analysis of Algorithms
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures Week 5 Further Data Structures The story so far  We understand the notion of an abstract data type.  Saw some fundamental operations.
IT253: Computer Organization
An experimental study of priority queues By Claus Jensen University of Copenhagen.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Chapter 5 Record Storage and Primary File Organizations
ARRAYS IN C/C++ (1-Dimensional & 2-Dimensional) Introduction 1-D 2-D Applications Operations Limitations Conclusion Bibliography.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Linear Time Suffix Array Construction Using D-Critical Substrings
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Why indexing? For efficient searching of a document
COMP9319 Web Data Compression and Search
Welcome to ….. File Organization.
Tries 07/28/16 11:04 Text Compression
Indexing Graphs for Path Queries with Applications in Genome Research
CHP - 9 File Structures.
Indexing Goals: Store large files Support multiple search keys
15-121: Introduction to Data Structures
Modified from Stanford CS276 slides Lecture 4: Index Construction
Applied Algorithmics - week7
Chapter 11: File System Implementation
Genomic Data Clustering on FPGAs for Compression
Hash-Based Indexes Chapter 11
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 15 QUERY EXECUTION.
Radish-Sort 11/11/ :01 AM Quick-Sort     2 9  9
Hash Tables.
Chapter 11: File System Implementation
Further Data Structures
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
Lecture 7: Index Construction
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Database Design and Programming
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
A Small and Fast IP Forwarding Table Using Hashing
Suffix Arrays and Suffix Trees
Chapter 11: File System Implementation
CENG 351 Data Management and File Structures
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
B+-trees In practice, B-trees are not used much as defined earlier.
Presentation transcript:

Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009

Constructing Really Big Compressed Suffix Arrays Jouni Sirén SPIRE 2009

Jouni Sirén, Compressed Suffix Arrays for Massive Data3 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions

Jouni Sirén, Compressed Suffix Arrays for Massive Data4 Suffix Arrays (SA) are full-text indexes for arbitrary texts. They consist of pointers to the suffixes of the text in lexicographic order. Quite large: 4 to 8 times the size of the text. Compressed Suffix Arrays (CSA) can be much smaller on highly repetitive texts. 210 or 336 GB for the SA and 42 GB for the text on the Finnish language Wikipedia with version history (fiwiki). A CSA requires just 2.1 GB, including the text. How to build a CSA for such a huge collection? Compressed Suffix Arrays

Jouni Sirén, Compressed Suffix Arrays for Massive Data5 The Standard Approach Build a regular SA and compress it. Quite fast: O(n) or O(n log n) in theory. ~1 MB / s for the best algorithms on current hardware. About 12 hours for fiwiki. Memory intensive: need space for the text and the SA. At least 5 to 9 times the size of the text. At least 250 GB for fiwiki.

Jouni Sirén, Compressed Suffix Arrays for Massive Data6 Other Approaches Secondary memory, dynamic indexes, direct CSA construction … Only a few implementations are available. Most are quite slow in practice. Often just 50 to 100 kB/s. 5 to 10 days for fiwiki.

Jouni Sirén, Compressed Suffix Arrays for Massive Data7 The Best Algorithms So Far Space-efficient BWT construction (Kärkkäinen 2007). Keeps the text in memory. Distributed SA construction (Kulla and Sanders 2007). Keeps the text, the SA, and a lot of temporary data in distributed memory. Merge suffix arrays in secondary memory (Gonnet et al. 1992). I/O volume is O(n 2 / M) words – quickly becomes slow when the input grows past memory size.

Jouni Sirén, Compressed Suffix Arrays for Massive Data8 Our Approach Build several smaller CSAs and merge them. Allows efficient parallel and distributed implementations. Easy to get O(n log n) time in |CSA| + O(n) bits. Results for the 42 GB fiwiki example: 9.6 hours and 32 GB on 8 cores. 25 hours and 8 GB on 2 cores. Supports bulk insertions and deletions.

Jouni Sirén, Compressed Suffix Arrays for Massive Data9 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions

Jouni Sirén, Compressed Suffix Arrays for Massive Data10 Burrows-Wheeler Transform CSAs are based on a permutation of the text called Burrows-Wheeler Transform (BWT). BWT is related to SA: BWT[i] = T[SA[i] – 1]. A CSA is essentially a compressed representation of the BWT supporting rank c (i), and select c (i). We generalize BWT for multiple texts.

Jouni Sirén, Compressed Suffix Arrays for Massive Data11 BWT of a Collection Let $ i < $ i+1 < c ∈ Σ, T 1 = ababbaa$ 1, T 2 = abbaa$ 2, T 3 = babba$ 3. $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a a $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a The BWT of T 1 T 2 T 3 is aaaaabbb$ 3 bb$ 1 bbb$ 2 aaaa.

Jouni Sirén, Compressed Suffix Arrays for Massive Data12 BWT of a Collection Let $ i < $ i+1 < c ∈ Σ, T 1 = ababbaa$ 1, T 2 = abbaa$ 2, T 3 = babba$ 3. $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a a $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a The BWT of T 1 T 2 T 3 is aaaaabbb$ 3 bb$ 1 bbb$ 2 aaaa.

Jouni Sirén, Compressed Suffix Arrays for Massive Data13 Generalized BWT Each character had a unique sort key. The suffix starting at the next position for regular characters. The next text in cyclical order for end markers. What if we used the same sort keys as in the BWTs of individual texts? T 1 T 2 T 3 a a a a a b b b $ 3 b b $ 1 b b b $ 2 a a a a {T 1, T 2, T 3 }a a a a a b b b $ 1 b b $ 2 b b b $ 3 a a a a {T 1 T 2, T 3 }a a a a a b b b $ 2 b b $ 1 b b b $ 3 a a a a What about {T 1 T 2, T 3 }?

Jouni Sirén, Compressed Suffix Arrays for Massive Data14 Some Bit Vectors A CSA is a compressed representation of bit vectors Ψ c : Ψ a Ψ b BWT(T 1 T 2, T 3 )a a a a a b b b $ b b $ b b b $ a a a a I BWT(T 3 )0 0 a 0 0 b 0 0 $ b 0 0 b 0 0 $ 0 a 0 0 BWT(T 1 T 2 ) and BWT(T 3 ) are subsequences of BWT(T 1 T 2, T 3 ).

Jouni Sirén, Compressed Suffix Arrays for Massive Data15 Merging BWTs If we have BWT(T 1 T 2 ) and BWT(T 3 ), we can use I to merge them to get BWT(T 1 T 2, T 3 ). (Hon et al. 2007): Rank of a suffix of T 3 among the suffixes of {T 1 T 2, T 3 } is the sum of its ranks among the suffixes of each part. We can use backward searching to get the ranks among the suffixes of T 1 T 2. We get the ranks among the suffixes of T 3 implicitly from BWT(T 3 ).

Jouni Sirén, Compressed Suffix Arrays for Massive Data16 The Algorithm Merge(BWT(A), BWT(B), B): # The ranks among the suffixes of A I := BackwardSearch(BWT(A), B) Sort(I) # Add the ranks among the suffixes of B for i := 1 to |B|: I[i] := I[i] + i BWT(A, B) := Interleave(BWT(A), BWT(B), I) return BWT(A, B)

Jouni Sirén, Compressed Suffix Arrays for Massive Data17 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions

Jouni Sirén, Compressed Suffix Arrays for Massive Data18 A Related Algorithm Hon et al.: A Space And Time Efficient Algorithm for Constructing Compressed Suffix Arrays. Algorithmica 48, pp. 23 – 36, Incremental construction of CSAs for a single sequence. The algorithm can be improved with some of our ideas.

Jouni Sirén, Compressed Suffix Arrays for Massive Data19 A Comparison between the Algorithms Our Algorithm Merge CSAs Multiple sequences Split between sequences CSA(T 1 ), CSA(T 2 ), array of |T 2 | integers Merge bit vectors Hon et al. Incremental construction A single sequence Split anywhere CSA(T 1 ), 4 arrays of |T 2 | integers Merge values of Ψ

Jouni Sirén, Compressed Suffix Arrays for Massive Data20 Improving the Algorithm of Hon et al. We already have CSA(Y), and want to build CSA(XY). We call the first l = |X| suffixes of T the long suffixes of T. Let SA l (T) be the subsequence of SA(T) containing them. SA l (XY) can be built by using X, SA l (Y), and the length l prefix of Y (Hon et al. 2007). We could build a CSA l (XY) instead. We could also build all the CSAs before merging them. Hence the peak memory usage becomes CSA(Y), CSA l (XY), and an array of l integers.

Jouni Sirén, Compressed Suffix Arrays for Massive Data21 The Improved Algorithm Our Algorithm Merge CSAs Multiple sequences Split between sequences CSA(T 1 ), CSA(T 2 ), array of |T 2 | integers Merge bit vectors Hon et al. Incremental construction A single sequence Split anywhere CSA(T 1 ), 4 arrays of |T 2 | integers Merge values of Ψ

Jouni Sirén, Compressed Suffix Arrays for Massive Data22 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions

Jouni Sirén, Compressed Suffix Arrays for Massive Data23 Conclusions We have presented a practical space-efficient algorithm for merging two CSAs. Improved space-efficient CSA construction: A parallel implementation can index tens of gigabytes in a reasonable time. A distributed implementation should be able to handle multiple terabytes. Also possible: Bulk insertions and deletions on a static CSA. Merge wavelet trees and build FM-indexes directly.