String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao

Slides:

Advertisements

Similar presentations

Boosting Textual Compression in Optimal Linear Time.

Advertisements

MATH 224 – Discrete Mathematics

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

296.3: Algorithms in the Real World

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.

Goodrich, Tamassia String Processing1 Pattern Matching.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

The Burrows-Wheeler Transform

6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.

Indexing and Searching

1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.

Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.

Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

CSC401 – Analysis of Algorithms Chapter 9 Text Processing

CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.

Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.

Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.

MCS 101: Algorithms Instructor Neelima Gupta

Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.

Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.

BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Evidence from Content INST 734 Module 2 Doug Oard.

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.

Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,

Linear Time Suffix Array Construction Using D-Critical Substrings

Burrows-Wheeler Transformation Review

COMP9319 Web Data Compression and Search

Tries 07/28/16 11:04 Text Compression

BWT-Transformation What is BWT-transformation? BWT string compression

Succinct: Enabling Queries on Compressed Data

Reducing the Space Requirement of LZ-index

13 Text Processing Hongfei Yan June 1, 2016.

Evaluation of Relational Operations

Strings: Tries, Suffix Trees

Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching

Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.

KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.

Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Strings: Tries, Suffix Trees

Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007

Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching

Sequences 5/17/ :43 AM Pattern Matching.

CSE 326: Data Structures Lecture #14

Presentation transcript:

String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao

Compressed Indexes - Nichols, Sheffi, Zhao 2 The Big Picture We’ve seen ways of using complex data structures (suffix arrays and trees) to perform character string queries The Burrows and Wheeler (BWT) transform is a reversible operation used on suffix arrays Compression on transformed suffix arrays improves performance

Compressed Indexes - Nichols, Sheffi, Zhao 3 Lecture Outline Motivation and compression Review of suffix arrays The BW transform (to and from) Searching in compressed indexes Conclusion Questions

Compressed Indexes - Nichols, Sheffi, Zhao 4 Motivation Most interesting massive data sets contain string data (the web, human genome, digital libraries, mailing lists) There are incredible amounts of textual data out there (~1000TB) (Ferragina) Performing high speed queries on such material is critical for many applications

Compressed Indexes - Nichols, Sheffi, Zhao 5 Why Compress Data? Compression saves space (though disks are getting cheaper -- < $1/GB) I/O bottlenecks and Moore’s law make CPU operations “free” Want to minimize seeks and reads for indexes too large to fit in main memory More on compression in lecture 21

Compressed Indexes - Nichols, Sheffi, Zhao 6 Background Last time, we saw the suffix array, which provides pointers to the ordered suffixes of a string T. T = ababc T[1] = ababc T[3] = abc T[2] = babc T[4] = bc T[5] = c A = [ ] Each entry in A tells us what the lexographic order of the ith substring is.

Compressed Indexes - Nichols, Sheffi, Zhao 7 Background What’s wrong with suffix trees and arrays? They use O(N log N) + N log |Σ| bits (array of N numbers + text, assuming alphabet Σ). This could be much more than the size of the uncompressed text, since usually log N = 32 and log |Σ| = 8. We can use compression to use less space in linear time!

Compressed Indexes - Nichols, Sheffi, Zhao 8 BW-Transform Why BWT? We can use the BWT to compress T in a provably optimal manner, using O(H k (T)) + o(1) bits per input symbol in the worst case, where H k (T) is the kth order empirical entropy. What is H k ? H k is the maximum compression we can achieve using for each character a code which depends on the k characters preceding it.

Compressed Indexes - Nichols, Sheffi, Zhao 9 The BW-Transform 1.Start with text T. Append # character, which is lexicographically before all other characters in the alphabet, Σ. 2.Generate all of the cyclic shifts of T# and sort them lexicographically, forming a matrix M with rows and columns equal to |T#| = |T| Construct L, the transformed text of T, by taking the last column of M.

Compressed Indexes - Nichols, Sheffi, Zhao 10 BW-Transform Example Let T = ababc Cyclic shifts of T#: ababc# #ababc c#abab bc#aba abc#ab babc#a M: Sorted cyclic shifts of T# #ababc ababc# abc#ab babc#a bc#aba c#abab

Compressed Indexes - Nichols, Sheffi, Zhao 11 BW-Transform Example Let T = ababc Cyclic shifts of T#: ababc# #ababc c#abab bc#aba abc#ab babc#a M: Sorted cyclic shifts of T# #ababc ababc# abc#ab babc#a bc#aba c#abab F = first column of M L = last column of M

Compressed Indexes - Nichols, Sheffi, Zhao 12 Inverse BW-Transform Construct C[1…|Σ|], which stores in C[c] the cumulative number of occurrences in T of characters 1 through c-1. Construct an LF-mapping LF[1…|T|+1] which maps each character to the character occurring previously in T using only L and C. Reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

Compressed Indexes - Nichols, Sheffi, Zhao 13 Inverse BW-Transform: Construction of C Store in C[c] the number of occurrences in T# of the characters {#, 1, …, c-1}. In our example: T# = ababc#  1 #, 2 a, 2 b, 1 c # a b c C = [ ] Notice that C[c] + n is the position of the nth occurrence of c in F (if any).

Compressed Indexes - Nichols, Sheffi, Zhao 14 Inverse BW-Transform: Constructing the LF-mapping Why and how the LF-mapping? Notice that for every row of M, L[i] directly precedes F[i] in the text (thanks to the cyclic shifts). Let L[i] = c, let r i be the number of occurrences of c in the prefix L[1,i], and let M[j] be the r i -th row of M that starts with c. Then the character in the first column F corresponding to L[i] is located at F[j]. How to use this fact in the LF-mapping?

Compressed Indexes - Nichols, Sheffi, Zhao 15 Inverse BW-Transform: Constructing the LF-mapping So, define LF[1…|T|+1] as LF[i] = C[L[i]] + r i. C[L[i]] gets us the proper offset to the zeroth occurrence of L[i], and the addition of r i gets us the r i -th row of M that starts with c.

Compressed Indexes - Nichols, Sheffi, Zhao 16 Inverse BW-Transform: Constructing the LF-mapping LF[i] = C[L[i]] + r i LF[1] = C[L[1]] + 1 = = 6 LF[2] = C[L[2]] + 1 = = 1 LF[3] = C[L[3]] + 1 = = 4 LF[4] = C[L[4]] + 1 = = 2 LF[5] = C[L[5]] + 2 = = 3 LF[6] = C[L[6]] + 2 = = 5 LF[] = [ ]

Compressed Indexes - Nichols, Sheffi, Zhao 17 Inverse BW-Transform: Reconstruction of T Start with T[] blank. Let u = |#T| Initialize s = 1 and T[u] = L[1]. We know that L[1] is the last character of T because M[1] = #T. For each i = u-1, …, 1 do: s = LF[s] (threading backwards) T[i] = L[s] (read off the next letter back)

Compressed Indexes - Nichols, Sheffi, Zhao 18 Inverse BW-Transform: Reconstruction of T First step: s = 1 T = [_ _ _ _ _ c] Second step: s = LF[1] = 6 T = [_ _ _ _ b c] Third step: s = LF[6] = 5 T = [_ _ _ a b c] Fourth step: s = LF[5] = 3 T = [_ _ b a b c] And so on…

Compressed Indexes - Nichols, Sheffi, Zhao 19 BW Transform Summary The BW transform is reversible We can construct it in O(n) time We can reverse it to reconstruct T in O(n) time, using O(n) space Once we obtain L, we can compress L in a provably efficient manner

Compressed Indexes - Nichols, Sheffi, Zhao 20 So, what can we do with compressed data? It’s compressed, hence saving us space; to search, simply decompress and search Search for the number of occurrences in the compressed (mostly compressed) data. Locate where the occurrences are in the original string from the compressed (mostly compressed) data.

Compressed Indexes - Nichols, Sheffi, Zhao 21 BWT_count Overview BWT_count begins with the last character of the query (P[1,p]) and works forwards Simplistically, BWT_count looks for the suffixes of P[1,p]. If a suffix of P[1,p] is not in T, quit. Running time is O(p) because running time of Occ(c, 1, k) is O(1) space needed = L compressed + space needed by Occ() = L compressed L + O((u / log u) log log u)

Compressed Indexes - Nichols, Sheffi, Zhao 22 Searching BWT-compressed text: Algorithm BW_count(P[1,p]) 1. c = P[p], i = p 2. sp = C[c] + 1, ep = C[c+1] 3. while ((sp  ep)) and (i  2)) do 4. c = P[i-1] 5. sp = C[c] + Occ(c, 1, sp – 1) ep = C[c] + Occ(c, 1, ep) 7. i = i if (ep < sp) then return “pattern” not found else return “found (ep – sp + 1) occurrences” Invariant: at the i-th stage, sp points at the first row of M prefixed by P[i, p] and ep points to the last row of M prefixed by P[i, p]. Occ(c, 1, k) finds the number of occurrences of c in the range 1 to k in L

Compressed Indexes - Nichols, Sheffi, Zhao 23 BWT_Count example #ababc 1 ababc# 2 abc#ab 3 babc#a 4 bc#aba 5 c#abab 6 c = # a b c P = ababc; C = [ ] Cispep initialc566 while 1b43+1+1=53+2=5 while 2a31+1+1=31+2=3 while 3b23+0+1=43+1=4 while 4a11+0+1=21+1=2  sp, ep 1  sp, ep 4  sp, ep 3  sp, ep 2 Notice that: # of c in L[1…sp] is the number of patterns which occur before P[i,p] # of c in L[1…ep] is the number of patterns which are smaller than or equal to P[i,p]  sp, ep 0

Compressed Indexes - Nichols, Sheffi, Zhao 24 Running Time of Occ(c, 1, k) We can do this trivially O(logk) with augmented B trees by exploiting the continuous runs in L –One tree per character –Nodes store ranges and total number of said character in that range By exploiting other techniques, we can reduce time to O(1)

Compressed Indexes - Nichols, Sheffi, Zhao 25 Locating the Occurrences Naïve solution: Use BWT_count to find number of occurrences and also sp and ep. Uncompress L, untransform M and calculate the position of the occurrence in the string. Better solution (time O(p + occ log 2 u), space O(u / log u): 1. preprocess M by logically marking rows in M which correspond to text positions (1 + in), where n = θ(log 2 u), and i = 0, 1, …, u/n 2. to find pos(s), if s is marked, done; otherwise, use LF to find row s’ corresponding to the suffix T[pos(s) – 1, u]. Iterate v times until s’ points to a marked row; pos(s) = pos(s’) + v Best solution (time O(p + occlog ε u), space …): Refine the better solution so that we still mark rows but we also have “shortcuts” so that we can jump by more than one character at a time

Compressed Indexes - Nichols, Sheffi, Zhao 26 Finding Occurrences Summary: Mark and store the position of every θ(log 2 u), rows in shifted T Shifted T u+1 by u+1 M u+1 by u+1 Compute M, L, LF, C L sp ep Run BWT_count For each row [sp, ep], use LF[] to shift backwards until a marked row is reached Count # shifts; add # shifts + pos of marked row Changing rows in L using LF[] is essentially shifting sequentially in T. Since marked rows are spaced θ(log 2 u) apart, at most we’ll shift θ(log 2 u) before we find a marked row. T U rows

Compressed Indexes - Nichols, Sheffi, Zhao 27 Locating Occurrences Example #ababc 1 ababc# 2 abc#ab 3 babc#a 4 bc#aba 5 c#abab 6 LF[] = [ ] sp, ep  marked, pos(2) = 1  pos(5) = ? pos(5) = 1 + pos(5) = pos(2) pos(5) = pos(5) = =

Compressed Indexes - Nichols, Sheffi, Zhao 28 Conclusions Free CPU operations make compression a great idea, given I/O bottlenecks The BW transform makes the index more amenable to compression We can perform string queries on a compressed index without any substantial performance loss

Compressed Indexes - Nichols, Sheffi, Zhao 29 Questions? Any questions?