COMP9319 Web Data Compression and Search

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Space-for-Time Tradeoffs
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Modern Information Retrieval
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Space Efficient Linear Time Construction of Suffix Arrays
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Indexing and Searching
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Linear Time Suffix Array Construction Using D-Critical Substrings
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
New Indices for Text : Pat Trees and PAT Arrays
Database Management System
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
CS 430: Information Discovery
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
Indexing and Searching (File Structures)
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Chapter 7 Space and Time Tradeoffs
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
Space-for-time tradeoffs
Indexing and Searching
Presentation transcript:

COMP9319 Web Data Compression and Search Text indexing and optimization Slides modified from the slides from haimk, Tel-Aviv U. and the slides from Pang Ko and Srinivas Aluru, Iowa State U.

Text search Pattern matching directly Regular expressions Brute force BM KMP Regular expressions Indices for pattern matching Inverted files Signature files Suffix trees and Suffix arrays

Readings Remember, as usual, references / original readings are available at: cd ~cs9319/Papers

Signature files Definition Structure Word-oriented index structure based on hashing. Use liner search. Suitable for not very large texts. Structure Based on a Hash function that maps words to bit masks. The text is divided in blocks. Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. Word not found, if no match between all 1 bits in the query mask and the block mask.

block 1 block 2 block3 block 4 Signature files Example: This is a text. A text has many words. Words are made from letters block 1 block 2 block3 block 4 000101 110101 100100 101101 Text signature h(text) = 000101 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001 Signature function

Signature files False drop Problem The corresponding bits are set even though the word is not there! The design should insure that the probability of false drop is low. Also the Signature file should be as short as possible. Enhance the hashing function to minimize the error probability.

Signature files Searching Construction For a single word, Hash word to a bit mask W. For phrases, Hash words in query to a bit mask. Bitwise OR of all the query masks to a bit mask W. Compare W to the bit masks Bi of all the text blocks. If all the bits set in W are also in Bi, then text block may contain the word. For all candidate text blocks, an online traversal must be performed to verify if the actual matches are there. Construction Cut the text in blocks. Generate an entry of the signature file for each block. This entry is the bitwise OR of the signatures of all the words in the block.

Suffix trees and suffix arrays

Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

Trie (Cont) Assume no string is a prefix of another c a b e b d e f c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

Compressed Trie Compress unary nodes, label edges by strings c c a a b  c a a b e b d bbf d eef e f c f c e g e g

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

Trivial algorithm to build a Suffix tree Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

$ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1

Analysis Takes O(n2) time to build. We will see how to do it in O(n) time

What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. W e may also want to find all occurrences of P in T

Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

By traversing this subtree we get all k occurrences in O(n+k) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To make these suffixes prefix-free we add a special char, say $, at the end of s To associate each suffix with a unique string in S add a different special char to each s

Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

So what can we do with it ? Matching a pattern against a database of strings

Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b Find such node with largest “string depth” # $ 4 a # $ 2 b 1 $ 3 2 1

Lowest common ancestor A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Let s = cbaaba$ then sr = abaabc# 7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1

Analysis O(n) time to identify all palindromes

Drawbacks Suffix trees consume a lot of space It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3 1 4 2

How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

Example Let S = mississippi i ippi issippi Let P = issa ississippi 8 5 2 1 10 9 7 4 11 6 3 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

Supra index Structure Suffix arrays are space efficient implementation of suffix trees. Simply an array containing all the pointers to the text suffixes listed in lexicographical order. Supra-indices: If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses. Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. To remedy this situation, the use of supra-indices over the suffix array has been proposed.

This is a text. A text has many words. Words are made from letters Supra index Example This is a text. A text has many words. Words are made from letters 1 6 9 11 17 19 24 28 33 40 46 50 55 60 suffix tree 1 3 5 6 60 50 28 19 11 40 33 60 50 28 19 11 40 33 Suffix Array lett text word Supra-Index 60 50 28 19 11 40 33 Suffix Array

Space Efficient Linear Time Construction of Suffix Arrays A good paper by Pang Ko and Srinivas Aluru

Suffix Array Sorted order of suffixes of a string T. Represented by the starting position of the suffix. Text M I S S I S S I P P I $ Index 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array 12 11 8 5 2 1 10 9 7 4 6 3

Brief History Introduced by Manber and Myers in 1989. Takes O(n log n) time, and 8n bytes. Many other non-linear time algorithms. Authors Time Space (bytes) Manber & Myers n log n 8n Sadakane 9n String-sorting n2 log n 5n Radix-sorting n2

Our Result Among the first linear-time direct suffix array construction algorithms. Solves an important open problem. For constant size alphabet, only uses 8n bytes. Easily implementable. Can also be used as a space efficient suffix tree construction algorithm.

Notation String T = t1…tn. Over the alphabet Σ = {1…n}. tn = ‘$’, ‘$’ is a unique character. Ti = ti…tn, denotes the i-th suffix of T. For strings  and ,  <  denotes  is lexicographically smaller than .

Overview Divide all suffixes of T into two types. Type S suffixes = {Ti | Ti < Ti+1} Type L suffixes = {Tj | Tj > Tj+1} The last suffix is both type S and L. Sort all suffixes of one of the types. Obtain lexicographical order of all suffixes from the sorted ones.

Identify Suffix Types Type L S L L S L L S L L L L/S Text M I S S I S $ The type of each suffix in T can be determined in one scan of the string. M > I T1 > T2 T1 is type L I < S T2 < T3 T2 is type S S = S, so check next character S > I T3 > T4 > T5 T3 and T4 are type L

Type S positions/ characters Notation Type S S S S Text M I S S I S S I P P I $ Type S suffixes Type S positions/ characters Type S substrings

Sorting Type S Suffixes Sort all type S substrings. Replace each type S substrings by its bucket number. New string is the sequence of bucket numbers. Sorting all type S suffixes = Sorting all suffixes of the new string.

Sorting Type S Substrings Substitute the substrings with the bucket numbers to obtain a new string. Apply sorting recursively to the new string. Text 3 M 3 I 2 S S 1 I S S I P P I $ Bucket Sort $ I I I According to the 1st character P S S According to the 2nd character Sort each substring until the next type S character Bucket Sort takes potentially O(n2) time P S S I I I Break into 2 buckets $ Break into 2 more buckets

Solution Observation: Each character participates in the bucket sort at most twice. Type L characters only participate in the bucket sort once. Solution: Sort all the characters once. Construct m lists according the distance to the closest type S character to the left

Illustration Type S S S S Index Text M I S S I S S I P P I $ Distance 1 2 3 4 5 6 7 8 9 10 11 12 Text M I S S I S S I P P I $ Distance 1 2 3 1 2 3 1 2 3 4 Sorted Order of characters 12 2 5 8 11 1 9 10 3 4 6 7 Sort the type S substrings using the lists The Lists

Construct Suffix Array for all Suffixes The first suffix in the suffix array is a type S suffix. For 1 ≤ i ≤ n, if TSA[i]-1 is type L, move it to the current front of its bucket $ I M P S 12 11 8 5 2 1 10 9 7 4 6 3 Sorted order of type S suffixes

Run-Time Analysis Identify types of suffixes -- O(n) time. Bucket sort type S (or L) substrings -- O(n) time. Construct suffix array from sorted type S (or L) suffixes -- O(n) time. T(n) = T(n/2) + O(n) = O(n)

Space Requirement For each element in an array use 2 integers to mark the beginning and the end of their bucket. Use Boolean array to mark the boundary of each bucket. This takes 12n bytes for |Σ|~n. At most 3n bits is needed for the Boolean arrays.

Constant size Alphabet No need to create the m lists. Because bucket sort takes O(n) time. Reducing the space needed to 4n bytes for the 1st recursion step. Maximum space used at anytime is 8n bytes (and maximum of 3 Boolean arrays).

Conclusion Among the first suffix array construction algorithm takes O(n) time. The algorithm can be easily implemented in 8n bytes (plus a few Boolean arrays). Equal or less space than most non-linear time algorithm. Can be used as a space efficient suffix tree construction algorithm.

Exercise bananainpajamas$ Consider the popular example string S: Construct the suffix array of S using the linear time algorithm Then compute the BWT(S) What’s the relationship between the suffix array and BWT ?

Solution Discussed in the lecture Will be available in the next lecture’s slides handout If you’ve missed this lecture, refer to the original paper for the details

Step – Identify the type of each suffix LSLSLSSSLSLSLSLL/S bananainpajamas$ 1 1234567890123456

Step – Compute the distance from S LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212

Step – Sort order of chars LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895

Step – Construct m-Lists LSLSLSSSLSLSLSLL/S bananainpajamas$ 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895 Scan this once and bucket it according to dist.

Step – Generate m-Lists [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14] 2022222011111111 $a bijmn ps 1 111 11 1 6246024171335895

Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] 1111111 1234567890123456 0012121112121212 $a bijmn ps 1 111 11 1 6246024171335895

Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Then? List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]

Step – Sort S substrings Bucket the S substrings [16],[2,4,6,10,12,14],[7],[8] After using List 1: [16],[6],[10],[12],[2,4],[14],[7],[8] List 2 useless. Consider 6 before 4: [16],[6],[10],[12],[4],[2],[14],[7],[8] List 1 [7],[11],[13],[3,5,8],[9],[15] List 2 [16],[4,6,10,12,14]

Step – Generate the Suffix Array [16],[6],[10],[12],[4],[2],[14],[7],[8] $a bijmn ps 1 111 11 1 6246024171335895 $a ins 1 11 1 1 6602424785

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a in s 1 11 1 1 66024247585

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a in ps 1 11 1 1 660242475895

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a ijn ps 1 11 1 1 1 6602424715895

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a ijn ps 1 11 1 1 1 66024247153895 type S

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a bijn ps 1 11 1 1 1 660242417153895

Step – Generate the Suffix Array $a bijmn ps 1 111 11 1 6246024171335895 $a bijmn ps 1 11 1 11 1 6602424171353895

Final answer bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895

Final answer bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895 What is the BWT(S) ?

BWT is easy! bananainpajamas$ 1111111 1234567890123456 Suffix Array: 1 11 1 11 1 6602424171353895 BWT: 1 1 11 11 1 5591313660242784

BWT construction in linear time bananainpajamas$ 1111111 1234567890123456 BWT: 1 1 11 11 1 5591313660242784 snpjnbm$aaaaaina