Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Slides:



Advertisements
Similar presentations
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Succinct Data Structures for Permutations, Functions and Suffix Arrays
Longest Common Subsequence
Analysis of Algorithms
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Space-for-Time Tradeoffs
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
BTrees & Bitmap Indexes
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Indexing and Searching
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Information Retrieval Space occupancy evaluation.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE 373 Data Structures Lecture 15
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Succinct Representations of Trees
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Random access to arrays of variable-length items
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Huffman Codes Let A1,...,An be a set of items.
Reachability on Suffix Tree Graphs
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

2 Outline Reminders Motivation Compression results Time & Space bounds Compressed Suffix Tree Compressed Suffix Array Proof of bounds

3 Reminder - Symbols T = t 1 t 2...t n-1 text of length n-1 eof symbol # at the n th position T[i,n] is suffix i of text T i=1,…,n

4 Reminder - Symbols P = p 1 p 2...p m pattern of length m 0<ε≤1

5 Reminder - Main Goal Search string pattern P within text T Support fast queries Text T being fully scanned only once

6 Reminder – Suffix Trees Leaf with value i represents suffix [i,n] Build time O(n) Search time O(m) Structure space O(n)

7 Reminder – Suffix Arrays Lexicographically ordered SA[i] = the starting position in T of the i-th suffix Σ={a,b} a<#<b T = bbba# a##ba#bba#bbba#

8 Reminder – Suffix Arrays Build time O(nlogn) Search time O(m+logn) Structure space O(n)

9 Motivation So Far Greedy in space Fast searching Need for space-efficient text indexing Reduce both space and query time

10 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n

11 Compressed Suffix Tree Build Suffix Array Build Compressed Suffix Tree Patricia Tries Compress Suffix Array

12 CSA Basic Operations Compress(T,SA) Return succinct representation of SA Retain T Discard SA Lookup(i) Return SA[i] Use compressed SA

13 CSA Primary measures Compress Preprocessing compressed SA Space of compressed SA lookup Query time

14 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)

15 Suffix Arrays Optimization Main idea Decomposition scheme Recursive structure of permutations

16 Decomposition Scheme K levels, K=0,….,l SA 0 = SA ( Original SA) n 0 =n n = |T| assumption - n is a power of 2 n k =n/2 k SA k ={1,2,…,n k )

17 SA k Succinct Representation 4 main steps: 1. Produce bit vector B k 2. Map B k 0’s to 1’s 3. Compute 1’s for each prefix in B k Using function rank k (j) 4. ‘Pack’ SA k

18 Step #1: Produce bit vector B k |B k | = n k B k [i]=1 if SA k [i] is even B k [i]=0 if SA k [i] is odd T = bba# 2431 SA 0 BoBo 1100

19 Step #2 : Map B k 0’s to 1’s New Fuction Ψ k (i), i=1,…,n k Ψ k (i) = jSA k [i] is odd and SA k [j]= SA k [i]+1 iotherwise (SA k [i] is even) T = bba# 2431 BoBo ΨoΨo SA 0

20 Step #3 : Compute 1’s for B k Recall fuction rank k (j), j=1,…,l k rank k (j) = number of 1’s on first j bits of B k T = bba# 2431 BoBo 1100 SA rank o

21 Step #4 : ‘Pack’ SA k Pack even values of SA k Divide by 2 New permutation {1,2,..,n k+1 } n k+1 =n k /2=n/2 k+1 Store new permutation into SA k+1 Remove SA k |SA k+1 | = |SA k |/2

22 Example: level 0, steps 1-3

23 Example: level 0, step 4

24 Lemma : Reconstruct SA k Results of phase k B k, Ψ k, rank k,SA k+1 Reconstruct SA k SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) i = 1,….n k

25 Proof, case 1, B k [i] = 1 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Step #4 : SA k [i]/2 stored in rank k (i) th entry of SA k+1 SA k [i] = 2 * SA k+1 [rank k (i)] Step #2 : Ψ k (i) = i

26 Proof, case 2, B k [i] = 0 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Ψ k (i) = j Step #2 : SA k [i] = SA k [j]-1 B k [j] = 1 Apply case 1 on j SA k [j] = 2 * SA k+1 [rank k (j)]

27 Example, case 1, B k [i] = 1 SA 0 [2] = ? B 0 [2]=1, Ψ 0 (2)=2, rank 0 (2) = 1 SA 0 [2]/2 stored in 1 st entry of SA 1 SA 0 [2] = 2 * SA 1 [1] = 2 * 8 = 16

28 Example, case 2, B k [i] = 0 SA 0 [3] = ? B 0 [3]=0, Ψ 0 (3) = 14, rank 0 (14) = 6 SA 0 [14] = 2 * SA 1 [6] = 2 * 16 = 32 SA 0 [3] = SA 0 [14] - 1 = = 31

29 Example - Decomposition

30 Determining l n 0 = n = 32 n 3 = 4 ~ n/logn can be stored in ≤ n bits Conclusion l = loglogn

31 CSA Structure K levels, k = 0,1,….,l-1 Store B k, Ψ k, rank k Final Level k = l Store only SA l

32 CSA Structure & Build B k n k bits per vector O(n k ) build rank k O(n k (loglogn k )/logn k ) bits As shown before O(n k ) build Sa l (n/2 l )logn bits

33 CSA Structure space - Ψ k List method 2 K lists possibilities for ‘prefixes’ of suffixes Number of lists increases L k = concatenation of all 2 K lists |L k | = n k /2 |L k | decreases

34 CSA Structure space - Ψ k For i = 1,…,n k /2 j = i th 1 in B k Pattern in 2 K (SA k [j]-1),…, 2 K *SA k [j]-1 matched to a list

35 Level 0 a list = {2,14,15,18,23,28,30,31} b list = {7,8,10,13,16,17,21,27}

36 Levels 1,2 Level 1 aa = {} //empty list ab = {9} ba = {1,6,12,14} bb = {2,4,5} Level 2 abba = {5,8} baba = {1} aabb = {4}

37 Reconstruct Ψ k B k [i] =1 Ψ k (i) = i B k [i] =0 h = number of 0’s in B k Ψ k (i) = L k [h]

38 example : Reconstruct Ψ k Ψ 0 (25) = ? B 0 [25] =0 h = = 13 Ψ 0 (25) = L 0 [13] = 16

39 example : Reconstruct Ψ k rank 0 (16) = 8 SA 1 [8] = ? Ψ 1 [8] = ? B 1 [8] =0 h = = 3 Ψ 1 (8) = L 1 [3] = 6

40 Lemma S sorted integers w bits per number S < 2 w Store integers S(2+w-logs)+O(s/loglogs) Retrieve h th integer O(1)

41 Store L k Store integers n(1/2+3/2 K+1 )+O(n/2 k loglogn) Retrieve h th integer O(1) Preprocess time O(n/2 k +2 2 k )

42 CSA Structure - Summary B k n k rank k O(n k (loglogn k )/logn k ) Sa l (n/2 l )logn Ψ k n(½+3/2 K+1 )+O(n/2 k loglogn)

43 Summing it up… nlogn/2 l + ½l*n + 5n + O(n/loglogn) ≤½nloglogn+n ½nloglogn + O(n) bits of storage

44 Preprocess - summary B k O(n k ) rank k O(n k ) Ψ k O(n/2 k +2 2 k ) Summing up 0,..,l-1 levels Preprocess time O(n)

45 lookup(i) lookup(i) refers to SA 0 [i] Need to reconstruct SA 0 [i] New procedure - rlookup(i,k) Recursive Based on lemma of reconstructing SA k

46 rlookup(i,k) If k = l Return Sa l [i] else Return 2*rlookup(rank k (Ψ k (i)),k+1)+(B k [i]-1)

47 Reconstruct SA k Lemma 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) lookup(i) = rlookup(i,0)

48 Example - lookup(i) lookup(5) = rlookup(5,0), l=3 2*rlookup(rank 0 (Ψ 0 (5)),1)+(B 0 [5]-1) 2*rlookup(10,1)+(-1)

49 Example – cont. rlookup(10,1) = 2*rlookup(rank 1 (Ψ 1 (10)),2)+(B 1 [10]-1) 2*rlookup(7,2)+(-1)

50 Example - cont. rlookup(7,2) = 2*rlookup(rank 2 (Ψ 2 (7)),3)+(B 2 [7]-1) 2*rlookup(2,3)+(-1)

51 Example - cont. rlookup(2,3) = lookup(5) = 2*(2*(2*3+(-1))+(-1))+(-1) = 2*(2*(5)+(-1))+(-1) = 2*(9)+(-1) = 17

52 lookup(i) lookup(i) = rlookup(i,0) l+1 levels O(1) per level O(loglogn) lookup time

53 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)

54 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n