Compressed suffix arrays and suffix trees with applications to text indexing and string matching.

Slides:



Advertisements
Similar presentations
Pointers.
Advertisements

Information Retrieval in Practice
SYMBOL TABLES &CODE GENERATION FOR EXECUTABLES. SYMBOL TABLES Compilers that produce an executable (or the representation of an executable in object module.
Longest Common Subsequence
CS252: Systems Programming Ninghui Li Program Interview Questions.
Analysis of Algorithms
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
5th July 2004CPM A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK)
Skip List & Hashing CSE, POSTECH.
Fusion Trees Advanced Data Structures Aris Tentes.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
HST 952 Computing for Biomedical Scientists Lecture 9.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
15-853Page : Algorithms in the Real World Suffix Trees.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
296.3: Algorithms in the Real World
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Modern Information Retrieval
BTrees & Bitmap Indexes
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CSE 373 Data Structures Lecture 15
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Random access to arrays of variable-length items
CS 61B Data Structures and Programming Methodology July 21, 2008 David Sun.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Succinct Data Structures
Succinct Data Structures
13 Text Processing Hongfei Yan June 1, 2016.
Analysis & Design of Algorithms (CSCE 321)
Randomized Algorithms CS648
Exercise: fourAB Write a method fourAB that prints out all strings of length 4 composed only of a’s and b’s Example Output aaaa baaa aaab baab aaba baba.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Suffix Trees String … any sequence of characters.
How to use hash tables to solve olympiad problems
Introduction to Computer Science
Presentation transcript:

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Jeffrey scott vitter Roberto Grosssi

Agenda  A (very) short review on suffix arrays  Introduction  Problem Definition  Information theory reasoning –Simple solution round 2  Compressed suffix arrays in ½ *nloglogn +O(n) bits and O(loglogn) access time  Rank And Select Problem definitions –Rank DS  Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time  Select data structure (if time permits)

Short review on suffix arrays  A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S  For example The string TelAviv and it ’ s corresponding suffix array telaviv S0S0S0S0 elaviv S1S1S1S1 laviv S2S2S2S2 aviv S3S3S3S3 viv S4S4S4S4 iv S5S5S5S5 v S6S6S6S

Introduction  Succinct data structures branch  Dna genome strings (small alphabet, large strings)  Mainly a Theoretical article

Problem Definition  The Algorithm Is composed of two phases –compression –lookup  Compress : – given a suffix array Sa compress it to get it ’ s succinct representation  lookup(i): –Given the compressed representation return SA[i]

Some Definitions  We will deal (at first) with binary alphabet –Σ = {a,b}  We will add a special end of string symbol #  And will set the relation between the characters to be –a<#<b (*)  Basic Ram Model –Log(n) word size –Word lookup and arithmetic in constant time

Information theory reasoning abba#15432abab#41532abaa#13524abaa#34152 aabb# aaba#14253aaab#12354aaaa#12345 bbbb# bbba# bbab# bbaa# babb# baba# baab# baaa#23451

Information theory reasoning (2)  Suffix array size nlog(n)  One to one corresponds between the suffix array to the string –Construction details  Number of possible suffix arrays 2 n-1 –Perfect compress n bits (the string itself) –The cost for lookup Ω(n) see prev lecture

“Simple” solution round 2 different approach  Let ’ s pack together each logn bits to create a new alphabet.  So the text length will be n/logn and the pattern length would be m/logn  The suffix array will take o(n) bits  Searching becomes hard (alignment) – the text is aligned but the pattern isn ’ t logn cases

“Simple” solution round 2  the text isn ’ t aligned the pattern occurs k bit right to a word boundary  Need to append k bits to the pattern and check it  So we need to check 2^k cases  K~logn => n different cases to check  Assuming we know how much to pad!!

General framework  Abstract Data Type Optimization [Jacobson'89]  # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits.  Doesn ’ t guarantee the time complexity on the supported operations

Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time  Recursive method in nature –Take advantage on the suffixes  Let Sa 0 be the uncompressed suffix array  And N 0 be it ’ s size (assume power of 2)  In The k phase of the compression we start with Sa k with the size and create Sa k+1 with the size Sa k+1 holds the permutation {1..N k+1 }

Sa k+1 Construction  Create the B k bit vector B k [i] = 1 iff Sa k [i] is even B k [i] = 1 iff Sa k [i] is even  create the Rank vector Rank k (j) counts the number of one bits in the first j bits of B k  Create the Ψk(i) vector –stores the 0 to 1 companion relation)  Store the even values from Sa k in Sa k+1

An Example  The 32 chars string T  abbabbabbabbabaaabababbabbbabba#

An Example aababbabbabbabbaText Sa B0B0B0B Rank Ψ0Ψ0Ψ0Ψ0

Example … #abbabbbabbababaa

How To compute Sa k from Sa k-1  Lemma 1 –Given suffix array Sa k let B k rank k Ψ k and Sa k+1 Be the result of the transformation performed by phase k we can construct Sa k from Sak+1 by the following formula Sa k [i] = 2* Sa k+1 [rank k (Ψ k (i))]+(B k [i]-1) –Let ’ s split for 2 cases  Bk[i] is even  Bk[i] is odd

Example continue Sa B1B1B1B Rank Ψ1Ψ1Ψ1Ψ Sa B2B2B2B Rank Ψ2Ψ2Ψ2Ψ Sa 3

Compress –We Keep l = O(loglogn) levels –All Levels but the Sal level are save implicitly –For each of the level 0..l-1 we save B j,rank j Ψ j –rank j Ψ j are stored implicitly –The Size of Sa l is

lookup  just compute recursively Sa k [i] from Sa k+1 [i]  Recursion depth loglogn  All data structure going to be used have o(1) access time  O(loglogn) lookup cost

How The Data Is Stored  The Bk bit vector is stored explctiy –O(Nk) space –O(1) lookup –O(Nk) preprocess time  The Rank K vector is stored implicitly using Jacobson rank data structure –O(N k (loglogn k )/logn k ) space –O(1) lookup –O(Nk) preprocess time  The Ψ k vector is stored implicitly (using rank and select)

Ψ k vector representation

Let’s Take a look

An Example aababbabbabbabbaText Sa B0B0B0B Rank Ψ0Ψ0Ψ0Ψ0

Example … #abbabbbabbababaa

So What can we do with all the list’s  Concatenate them together in a lexicographical order and form the Lk list  L 1 ={9,1,6,12,14,2,4,5}  Let ’ s see how we can compute Ψ k (i) –If B k [i] is even, it ’ s simply i –Otherwise, – because all the prefix patterns saved are in sorted order, –We saved in the Lk list till the point i, entries for all the odd suffix ’ s before i, h=i-rank[i] –So we can look up the h entry in Lk  And it will give us the answer

Simple example  L 2 ={5,8,2,4}  Rank 2 ={1,1,1,2,3,3,3,4}  B 2 ={1,0,0,1,1,0,0,1}  Ψ2={1,5,8,4,5,1,4,8}  Ψ(3) = ?  Rank(3) = 1, h= 3-1, L2[2] = 8  Ψ(3) =8  Ψ(3) =8

Rank and select  Given a bit vector length n  Rank[i] is the number of 1 bits till I  Select(i) returns the index of the ith 1

Ψ k vector representation  Lemma 2 Given s integers in sorted order, each containing w bits,where s<2 w each containing w bits,where s<2 w we can store them with at most s(2+w-floor(logs))+O(s/loglogs) bits so that retrieving the hth integer takes constant time

Ψ k vector representation Take the first z=floor(logs) bits of each int, creating the q 1..q s int It ’ s easy to see that, q 1 <q i <q i+1 <s (we take the msb bits after all) The rest w-z bits of each int, will be r i SiSi qiqi 101 riri

Ψ k vector representation Store r i in a simple array, (w-z)*s bits Store q 1..q s in a table supporting select and rank in constant time. The table Q is implemented in the following way Instead of saving the number themselves, we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 in unary representation )0 i 1( And add a select data structure.

Ψ k vector representation In order to get qi we simply do select(i), and count the number of zeros before the ith 1 Qi = select(i) - rank(select(i))

Ψ k vector representation The q table size is the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) So we can output Si easily S i =q i *2 w-z +r i

Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)

Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)  Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation  X` bit size is, 2 K +logn k,

Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)  Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation  X` bit size is, 2 K +logn k,  After concatenating all the lists,we have a N k /2 sorted numbers sized 2 K +logn k bits  Using lemma 2 we get.  O(1) access time  And a space bound of n(1/2+3/2 k+1 )+O(n/2 k loglogn) bits

Sum it up (space complexity)

Rank data structure  Due to Jacobson  Given a bit vector length n,Rank[i] is the number of 1 bits till I  Multilevel approach  We will slice the bit string to log2n chunks.  Between each chunk we will keep rank counter  Each chunk will be divvied into ½ * logn chunks,  And a counter will be kept between each sub chunks  At The Bottom Level a simple Lookup table will be used.

Rank Lookup table 14 Log 2 n chunks ½ logn sub chunks The output

Rank Analysis

Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time  In order to break the space barrier we need to save less levels =>longer lookup ’ s  Lets save 3 compressed levels only Sa 0 Sa l Sa l` L = ceil(loglogn), l`=ceil(1/2loglogn)  using A Dictionary data structure, which Can say If an element is member of the Dictionary, and support a rank query, O(1) time for both queries  The Space complexity of the dictionary is  We keep in 2 dictionaries what items we have in the next level D 0 and Dl (from Sa 0 ->Sa l` Sa l` ->Sa l

The Ψ` k function  We define the Ψ` k function, which maps each 1 to it ’ s companion 0  Let ’ s define the φ k function to be  We just need to merge the indexes in L k and L` k

Example

The φ k function implementation  Lemma 4 :We can store the concatenated list used for φ k –k =0 in n+O(n/loglogn) bits –K>0 in n*(1+1/2 k-1 )+O(n/2kloglogn), preprocess time of O(n/2 k +2 2 k ) –If k>0 simply using lemma 3 –K=0  Encode a,# as 0, and b as 1.  Create a n bit vector, named l  L[f] = 0 iff the list for φ 0 is a or # at the f position  We add a select and select 0 data structure on top of it. O(n/loglogn)  Also we keep the number of 0 in l as c0,  Query φ k (j) is done in the following way  if j = C0, return select0(c0)  If j<c0 return select0(j)  If j>c0 return select(j-c0)

The Lookup algorithm  Sa[i], we start walking the φ k function i,i`,i``,i```  Sa0[i]+1=Sa0[i`] …  Until reaching entry found in the dictionary D 0, –Let s be the walk length –And r the entry rank in the dictionary (how many items, already passed to the next level?)  Using r we start walking the next level –Let s` be the walk length –And r` the entry rank in the dictionary  we return the following result  The walk length is, max(s,s`)<2 l` <sqr(logn)  So the query time is O(sqr(logn))

The General multilevel Build  For every 0<ε<1,  Assume εl is an integer so 2 εl <2log ε n  Create all the levels, 0, εl,2εl..l  Number of levels is ε => lookup of O(log ε n)

The General multilevel Build

Select data structure  select(i)- returns the i 1 bit in the string  Same idea as rank, a bit more complicated  multilevel approach  At the first level we record the position of every lognloglogn th bit, –Total space o(N/loglogn)  Between each two bits, we keep the following data,  If the distance between them r>(lognloglogn) 2 –we keep the absolute pos of all the indexes between them  log 2 nloglogn –Other wise we keep, the relative position of each logrloglogn th bit  Total space logr*loglogn <log 2 nloglogn = r/loglogn r<N !!!  Then we keep one more level (the same notions) –Block size comes to the size of (lgn) 4

Select data structure  After that, we keep a lookup table  For every logn/d pattern we save (d>=2) –Number of 1 bits, –the location of the ith 1 bit in the pattern  Same as before the space is O(n 1/d lognloglogn)  The lookup is then very simple, just walk the levels,  Get a block and ask a query about him using the lookup table.  Space complexity, O(n/loglogn)