COMP9319 Web Data Compression and Search Text indexing and optimization Slides modified from the slides from haimk, Tel-Aviv U. and the slides from Pang Ko and Srinivas Aluru, Iowa State U.
Text search Pattern matching directly Regular expressions Brute force BM KMP Regular expressions Indices for pattern matching Inverted files Signature files Suffix trees and Suffix arrays
Readings Remember, as usual, references / original readings are available at: cd ~cs9319/Papers
Signature files Definition Structure Word-oriented index structure based on hashing. Use liner search. Suitable for not very large texts. Structure Based on a Hash function that maps words to bit masks. The text is divided in blocks. Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. Word not found, if no match between all 1 bits in the query mask and the block mask.
block 1 block 2 block3 block 4 Signature files Example: This is a text. A text has many words. Words are made from letters block 1 block 2 block3 block 4 000101 110101 100100 101101 Text signature h(text) = 000101 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001 Signature function
Signature files False drop Problem The corresponding bits are set even though the word is not there! The design should insure that the probability of false drop is low. Also the Signature file should be as short as possible. Enhance the hashing function to minimize the error probability.
Signature files Searching Construction For a single word, Hash word to a bit mask W. For phrases, Hash words in query to a bit mask. Bitwise OR of all the query masks to a bit mask W. Compare W to the bit masks Bi of all the text blocks. If all the bits set in W are also in Bi, then text block may contain the word. For all candidate text blocks, an online traversal must be performed to verify if the actual matches are there. Construction Cut the text in blocks. Generate an entry of the signature file for each block. This entry is the bitwise OR of the signatures of all the words in the block.
Suffix trees and suffix arrays
Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e
Trie (Cont) Assume no string is a prefix of another c a b e b d e f c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g
Compressed Trie Compress unary nodes, label edges by strings c c a a b c a a b e b d bbf d eef e f c f c e g e g
Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s
Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $
Trivial algorithm to build a Suffix tree Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $
a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $
a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $
a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $
$ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1
Analysis Takes O(n2) time to build. We will see how to do it in O(n) time
What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. W e may also want to find all occurrences of P in T
Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.
By traversing this subtree we get all k occurrences in O(n+k) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time
Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S To make these suffixes prefix-free we add a special char, say $, at the end of s To associate each suffix with a unique string in S add a different special char to each s
Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1
So what can we do with it ? Matching a pattern against a database of strings
Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b Find such node with largest “string depth” # $ 4 a # $ 2 b 1 $ 3 2 1
Lowest common ancestor A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1
Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr
Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr
Let s = cbaaba$ then sr = abaabc# 7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1
Analysis O(n) time to identify all palindromes
Drawbacks Suffix trees consume a lot of space It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node
Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3 1 4 2
How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time
How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time
Example Let S = mississippi i ippi issippi Let P = issa ississippi 8 5 2 1 10 9 7 4 11 6 3 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R
Supra index Structure Suffix arrays are space efficient implementation of suffix trees. Simply an array containing all the pointers to the text suffixes listed in lexicographical order. Supra-indices: If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses. Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. To remedy this situation, the use of supra-indices over the suffix array has been proposed.
This is a text. A text has many words. Words are made from letters Supra index Example This is a text. A text has many words. Words are made from letters 1 6 9 11 17 19 24 28 33 40 46 50 55 60 suffix tree 1 3 5 6 60 50 28 19 11 40 33 60 50 28 19 11 40 33 Suffix Array lett text word Supra-Index 60 50 28 19 11 40 33 Suffix Array
Space Efficient Linear Time Construction of Suffix Arrays A good paper by Pang Ko and Srinivas Aluru
Suffix Array Sorted order of suffixes of a string T. Represented by the starting position of the suffix. Text M I S S I S S I P P I $ Index 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array 12 11 8 5 2 1 10 9 7 4 6 3
Brief History Introduced by Manber and Myers in 1989. Takes O(n log n) time, and 8n bytes. Many other non-linear time algorithms. Authors Time Space (bytes) Manber & Myers n log n 8n Sadakane 9n String-sorting n2 log n 5n Radix-sorting n2
Our Result Among the first linear-time direct suffix array construction algorithms. Solves an important open problem. For constant size alphabet, only uses 8n bytes. Easily implementable. Can also be used as a space efficient suffix tree construction algorithm.
Notation String T = t1…tn. Over the alphabet Σ = {1…n}. tn = ‘$’, ‘$’ is a unique character. Ti = ti…tn, denotes the i-th suffix of T. For strings and , < denotes is lexicographically smaller than .
Overview Divide all suffixes of T into two types. Type S suffixes = {Ti | Ti < Ti+1} Type L suffixes = {Tj | Tj > Tj+1} The last suffix is both type S and L. Sort all suffixes of one of the types. Obtain lexicographical order of all suffixes from the sorted ones.
Identify Suffix Types Type L S L L S L L S L L L L/S Text M I S S I S $ The type of each suffix in T can be determined in one scan of the string. M > I T1 > T2 T1 is type L I < S T2 < T3 T2 is type S S = S, so check next character S > I T3 > T4 > T5 T3 and T4 are type L
Type S positions/ characters Notation Type S S S S Text M I S S I S S I P P I $ Type S suffixes Type S positions/ characters Type S substrings
Sorting Type S Suffixes Sort all type S substrings. Replace each type S substrings by its bucket number. New string is the sequence of bucket numbers. Sorting all type S suffixes = Sorting all suffixes of the new string.
Sorting Type S Substrings Substitute the substrings with the bucket numbers to obtain a new string. Apply sorting recursively to the new string. Text 3 M 3 I 2 S S 1 I S S I P P I $ Bucket Sort $ I I I According to the 1st character P S S According to the 2nd character Sort each substring until the next type S character Bucket Sort takes potentially O(n2) time P S S I I I Break into 2 buckets $ Break into 2 more buckets
Solution Observation: Each character participates in the bucket sort at most twice. Type L characters only participate in the bucket sort once. Solution: Sort all the characters once. Construct m lists according the distance to the closest type S character to the left
Illustration Type S S S S Index Text M I S S I S S I P P I $ Distance 1 2 3 4 5 6 7 8 9 10 11 12 Text M I S S I S S I P P I $ Distance 1 2 3 1 2 3 1 2 3 4 Sorted Order of characters 12 2 5 8 11 1 9 10 3 4 6 7 Sort the type S substrings using the lists The Lists
Construct Suffix Array for all Suffixes The first suffix in the suffix array is a type S suffix. For 1 ≤ i ≤ n, if TSA[i]-1 is type L, move it to the current front of its bucket $ I M P S 12 11 8 5 2 1 10 9 7 4 6 3 Sorted order of type S suffixes
Run-Time Analysis Identify types of suffixes -- O(n) time. Bucket sort type S (or L) substrings -- O(n) time. Construct suffix array from sorted type S (or L) suffixes -- O(n) time. T(n) = T(n/2) + O(n) = O(n)
Space Requirement For each element in an array use 2 integers to mark the beginning and the end of their bucket. Use Boolean array to mark the boundary of each bucket. This takes 12n bytes for |Σ|~n. At most 3n bits is needed for the Boolean arrays.
Constant size Alphabet No need to create the m lists. Because bucket sort takes O(n) time. Reducing the space needed to 4n bytes for the 1st recursion step. Maximum space used at anytime is 8n bytes (and maximum of 3 Boolean arrays).
Conclusion Among the first suffix array construction algorithm takes O(n) time. The algorithm can be easily implemented in 8n bytes (plus a few Boolean arrays). Equal or less space than most non-linear time algorithm. Can be used as a space efficient suffix tree construction algorithm.
Exercise bananainpajamas$ Consider the popular example string S: Construct the suffix array of S using the linear time algorithm Then compute the BWT(S) What’s the relationship between the suffix array and BWT ?
Solution Discussed in the lecture Will be available in the next lecture’s slides handout If you’ve missed this lecture, refer to the original paper for the details
Step – Identify the type of each suffix LSLSLSSSLSLSLSLL/S bananainpajamas$ 1 1234567890123456
Step – Compute the distance from S LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212
Step – Sort order of chars LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895
Step – Construct m-Lists LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895 Scan this once and bucket it according to dist.
Step – Generate m-Lists [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14] 2022222011111111 $a bijmn ps 1 111 11 1 6246024171335895
Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895
Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Then? List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]
Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Consider 6 before 4: [16],[6],[10],[12],[4],[2],[14],[7],[8] List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]
Step – Generate the Suffix Array [16],[6],[10],[12],[4],[2],[14],[7],[8] $a bijmn ps 1 111 11 1 6246024171335895 $a ins 1 11 1 1 6602424785
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a in s 1 11 1 1 66024247585
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a in ps 1 11 1 1 660242475895
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a ijn ps 1 11 1 1 1 6602424715895
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a ijn ps 1 11 1 1 1 66024247153895 type S
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a bijn ps 1 11 1 1 1 660242417153895
Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a bijmn ps 1 11 1 11 1 6602424171353895
Final answer bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895
Final answer bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895 What is the BWT(S) ?
BWT is easy! bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895 BWT: 1 1 11 11 1 5591313660242784
BWT construction in linear time bananainpajamas$ 1111111 1234567890123456 BWT: 1 1 11 11 1 5591313660242784 snpjnbm$aaaaaina