COMP9319 Web Data Compression and Search

COMP9319 Web Data Compression and Search
Text indexing and optimization Slides modified from the slides from haimk, Tel-Aviv U. and the slides from Pang Ko and Srinivas Aluru, Iowa State U.

Text search Pattern matching directly Regular expressions
Brute force BM KMP Regular expressions Indices for pattern matching Inverted files Signature files Suffix trees and Suffix arrays

Readings Remember, as usual, references / original readings are available at: cd ~cs9319/Papers

Signature files Definition Structure
Word-oriented index structure based on hashing. Use liner search. Suitable for not very large texts. Structure Based on a Hash function that maps words to bit masks. The text is divided in blocks. Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. Word not found, if no match between all 1 bits in the query mask and the block mask.

block 1 block 2 block3 block 4
Signature files Example: This is a text. A text has many words. Words are made from letters block block block block 4 000101 110101 100100 101101 Text signature h(text) = h(many) = h(words) = h(made) = h(letters) = Signature function

Signature files False drop Problem
The corresponding bits are set even though the word is not there! The design should insure that the probability of false drop is low. Also the Signature file should be as short as possible. Enhance the hashing function to minimize the error probability.

Signature files Searching Construction
For a single word, Hash word to a bit mask W. For phrases, Hash words in query to a bit mask. Bitwise OR of all the query masks to a bit mask W. Compare W to the bit masks Bi of all the text blocks. If all the bits set in W are also in Bi, then text block may contain the word. For all candidate text blocks, an online traversal must be performed to verify if the actual matches are there. Construction Cut the text in blocks. Generate an entry of the signature file for each block. This entry is the bitwise OR of the signatures of all the words in the block.

Suffix trees and suffix arrays

Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

Trie (Cont) Assume no string is a prefix of another c a b e b d e f c
Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

Compressed Trie Compress unary nodes, label edges by strings c c a a b
 c a a b e b d bbf d eef e f c f c e g e g

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

Trivial algorithm to build a Suffix tree
Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

$ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1

Analysis Takes O(n2) time to build.
We will see how to do it in O(n) time

What can we do with it ? Exact string matching:
Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. W e may also want to find all occurrences of P in T

Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

By traversing this subtree we get all k occurrences in O(n+k) time
$ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

Generalized suffix tree
Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To make these suffixes prefix-free we add a special char, say $, at the end of s To associate each suffix with a unique string in S add a different special char to each s

Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

So what can we do with it ? Matching a pattern against a database of strings

Longest common substring (of two strings)
Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b Find such node with largest “string depth” # $ 4 a # $ 2 b 1 $ 3 2 1

Lowest common ancestor
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

Finding maximal palindromes
A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Let s = cbaaba$ then sr = abaabc#
7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1

Analysis O(n) time to identify all palindromes

Drawbacks Suffix trees consume a lot of space
It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Suffix array We loose some of the functionality but we save space.
Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3 1 4 2

How do we build it ? Build a suffix tree
Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

How do we search for a pattern ?
If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

Example Let S = mississippi i ippi issippi Let P = issa ississippi
8 5 2 1 10 9 7 4 11 6 3 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

Supra index Structure Suffix arrays are space efficient implementation of suffix trees. Simply an array containing all the pointers to the text suffixes listed in lexicographical order. Supra-indices: If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses. Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. To remedy this situation, the use of supra-indices over the suffix array has been proposed.

This is a text. A text has many words. Words are made from letters
Supra index Example This is a text. A text has many words. Words are made from letters suffix tree 1 3 5 6 60 50 28 19 11 40 33 60 50 28 19 11 40 33 Suffix Array lett text word Supra-Index 60 50 28 19 11 40 33 Suffix Array

Space Efficient Linear Time Construction of Suffix Arrays
A good paper by Pang Ko and Srinivas Aluru

Suffix Array Sorted order of suffixes of a string T.
Represented by the starting position of the suffix. Text M I S S I S S I P P I $ Index 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array 12 11 8 5 2 1 10 9 7 4 6 3

Brief History Introduced by Manber and Myers in 1989.
Takes O(n log n) time, and 8n bytes. Many other non-linear time algorithms. Authors Time Space (bytes) Manber & Myers n log n 8n Sadakane 9n String-sorting n2 log n 5n Radix-sorting n2

Our Result Among the first linear-time direct suffix array construction algorithms. Solves an important open problem. For constant size alphabet, only uses 8n bytes. Easily implementable. Can also be used as a space efficient suffix tree construction algorithm.

Notation String T = t1…tn. Over the alphabet Σ = {1…n}.
tn = ‘$’, ‘$’ is a unique character. Ti = ti…tn, denotes the i-th suffix of T. For strings  and ,  <  denotes  is lexicographically smaller than .

Overview Divide all suffixes of T into two types.
Type S suffixes = {Ti | Ti < Ti+1} Type L suffixes = {Tj | Tj > Tj+1} The last suffix is both type S and L. Sort all suffixes of one of the types. Obtain lexicographical order of all suffixes from the sorted ones.

Identify Suffix Types Type L S L L S L L S L L L L/S Text M I S S I S
$ The type of each suffix in T can be determined in one scan of the string. M > I T1 > T2 T1 is type L I < S T2 < T3 T2 is type S S = S, so check next character S > I T3 > T4 > T5 T3 and T4 are type L

Type S positions/ characters
Notation Type S S S S Text M I S S I S S I P P I $ Type S suffixes Type S positions/ characters Type S substrings

Sorting Type S Suffixes
Sort all type S substrings. Replace each type S substrings by its bucket number. New string is the sequence of bucket numbers. Sorting all type S suffixes = Sorting all suffixes of the new string.

Sorting Type S Substrings
Substitute the substrings with the bucket numbers to obtain a new string. Apply sorting recursively to the new string. Text 3 M 3 I 2 S S 1 I S S I P P I $ Bucket Sort $ I I I According to the 1st character P S S According to the 2nd character Sort each substring until the next type S character Bucket Sort takes potentially O(n2) time P S S I I I Break into 2 buckets $ Break into 2 more buckets

Solution Observation: Each character participates in the bucket sort at most twice. Type L characters only participate in the bucket sort once. Solution: Sort all the characters once. Construct m lists according the distance to the closest type S character to the left

Illustration Type S S S S Index Text M I S S I S S I P P I $ Distance
1 2 3 4 5 6 7 8 9 10 11 12 Text M I S S I S S I P P I $ Distance 1 2 3 1 2 3 1 2 3 4 Sorted Order of characters 12 2 5 8 11 1 9 10 3 4 6 7 Sort the type S substrings using the lists The Lists

Construct Suffix Array for all Suffixes
The first suffix in the suffix array is a type S suffix. For 1 ≤ i ≤ n, if TSA[i]-1 is type L, move it to the current front of its bucket $ I M P S 12 11 8 5 2 1 10 9 7 4 6 3 Sorted order of type S suffixes

Run-Time Analysis Identify types of suffixes -- O(n) time.
Bucket sort type S (or L) substrings -- O(n) time. Construct suffix array from sorted type S (or L) suffixes -- O(n) time. T(n) = T(n/2) + O(n) = O(n)

Space Requirement For each element in an array use 2 integers to mark the beginning and the end of their bucket. Use Boolean array to mark the boundary of each bucket. This takes 12n bytes for |Σ|~n. At most 3n bits is needed for the Boolean arrays.

Constant size Alphabet
No need to create the m lists. Because bucket sort takes O(n) time. Reducing the space needed to 4n bytes for the 1st recursion step. Maximum space used at anytime is 8n bytes (and maximum of 3 Boolean arrays).

Conclusion Among the first suffix array construction algorithm takes O(n) time. The algorithm can be easily implemented in 8n bytes (plus a few Boolean arrays). Equal or less space than most non-linear time algorithm. Can be used as a space efficient suffix tree construction algorithm.

Exercise bananainpajamas$ Consider the popular example string S:
Construct the suffix array of S using the linear time algorithm Then compute the BWT(S) What’s the relationship between the suffix array and BWT ?

Solution Discussed in the lecture
Will be available in the next lecture’s slides handout If you’ve missed this lecture, refer to the original paper for the details

Step – Identify the type of each suffix
LSLSLSSSLSLSLSLL/S bananainpajamas$ 1

Step – Compute the distance from S
LSLSLSSSLSLSLSLL/S bananainpajamas$

Step – Sort order of chars
LSLSLSSSLSLSLSLL/S bananainpajamas$ $a bijmn ps

Step – Construct m-Lists
LSLSLSSSLSLSLSLL/S bananainpajamas$ $a bijmn ps Scan this once and bucket it according to dist.

Step – Generate m-Lists
[7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14] $a bijmn ps

Step – Sort S substrings
Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] $a bijmn ps

Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Then? List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]

Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Consider 6 before 4: [16],[6],[10],[12],[4],[2],[14],[7],[8] List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]

Step – Generate the Suffix Array
[16],[6],[10],[12],[4],[2],[14],[7],[8] $a bijmn ps $a ins

$a bijmn ps $a in s

$a bijmn ps $a in ps

$a bijmn ps $a ijn ps

$a bijmn ps $a ijn ps type S

$a bijmn ps $a bijn ps

$a bijmn ps $a bijmn ps

Final answer bananainpajamas$ Suffix Array:

Final answer bananainpajamas$ 1111111 1234567890123456 Suffix Array:
What is the BWT(S) ?

BWT is easy! bananainpajamas$ Suffix Array: BWT:

BWT construction in linear time
bananainpajamas$ BWT: snpjnbm$aaaaaina

COMP9319 Web Data Compression and Search

Similar presentations

Presentation on theme: "COMP9319 Web Data Compression and Search"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP9319 Web Data Compression and Search

Similar presentations

Presentation on theme: "COMP9319 Web Data Compression and Search"— Presentation transcript:

Similar presentations

About project

Feedback