Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Suffix Trees Construction and Applications João Carreira 2008.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
15-853Page : Algorithms in the Real World Suffix Trees.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Indexing and Searching
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Suffix trees.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Outline The Text Searching Problem What is Full-Text Indexing? Burrows-Wheeler Transform (BWT) BWT as a Full-Text Index Related work

Text Searching Text : acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?

How fast can we search? Let n be the length of text m be the length of pattern We can find all positions that the pattern appears in O( n + m ) time Knuth-Morris-Pratt, Boyer-Moore Is O(n+m) time good? Yes, because it is optimal!

Text Searching (take 2) Pattern: acac Where does the pattern occur in the text? Text : acacaaccagtcacactagac…… we know the text in advance and can preprocess it

Can we do better? Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m +  ) time, where  = number of times the pattern appears in the text Such a data structure is called an index Is O(m+  ) time useful? Yes, if the text is very long and it is searched many times for different patterns

Full-Text Index Deals with creating an index for a text Also, each position in the text corresponds to an appearance of at least one pattern (full) Word-Level Index Text is a sequence of words The positions within a word does not correspond to appearance of any pattern E.g., Text: Was it a cat I saw? (Pattern: “at” does not have an appearance)

Suffix Tree: An Optimal Full-Text Index As mentioned, we can create an index for the text such that pattern searching can be done in O(m+  ) time This time is optimal One such index is the Suffix Tree Introduced independently by E. McCreight in 1976 and P. Weiner in 1973

Suffix and Suffix Tree Given a string S, a substring of S that ends at the last position is called a suffix of S If S consists of n chars, S has exactly n suffixes Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j

E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2) acaac# (start at pos 3) caac# (start at pos 4) aac# (start at pos 5) ac# (start at pos 6) c# (start at pos 7)  # (start at pos 8) Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S. acacaac# acaac# acacaac# ac#

Suffix and Suffix Tree (2) The suffix tree is an edge-labeled compact tree (no degree-1 nodes) with n leaves such that each leaf corresponds to a suffix Concatenating edge labels along the path from root to leaf gives the corresponding suffix Edge-label to each child starts with different character Example (next slide)

# c c a a# # c a # a # c a # c a # c a a c # c a a c The Suffix Tree of acacaac#

Searching with Suffix Tree To search P, we match P starting from the root If we can match P successfully in the tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text Then, we traverse the tree under the stop point to report where P appears So, searching is done in O(m+  ) time

Is Suffix Tree good? Yes, because optimal search time No, because of space requirement… The space can be much larger than the text E.g., Text = DNA of Human To store the text, we need 0.8 Gbyte To store the suffix tree, we need 64 Gbyte!

Something Wrong?? Both the suffix tree and the text has n things, so they both need O(n) space… How come there is a big difference?? Let us have a better analysis Let A be the alphabet (i.e., the set of distinct characters) of a text T E.g., in DNA, A = {a,c,g,t}

Something Wrong?? (2) To store T, we need only n log |A| bits But to store the suffix tree, we will need n log n bits When n is very large compared to |A|, there is a huge difference Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??

Burrows-Wheeler Transform By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’ Example (next slide)

# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac#

BWT is useful BWT is shown to be compressed more easily than the original text Also, given the position in the BWT array where the last character appears, we can get back the original text How?

# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac# # a a a a c c c Sorted BWT

BWT  Index Ferragina and Manzini (2000) observes that we can use BWT to support pattern searching by storing some additional O(n)- bit arrays Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)

# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac#, Pattern = aca # a a a a c c c Sorted BWT

BWT  Index They also show that, by storing another O(n) bit array, we can report where the pattern appears in O(  log n) time So, searching is done in O(m +  log n) time What is the space? O( n log |A| ) bits

Related Work Further compress the index Space is now measured in terms of the entropy (or the randomness) of a text Support text with large alphabet Efficient Construction Challenge is in minimizing working space More complex queries and operations Library problem, Dictionary problem

Pointers for Further Study The Pizza & Chili website The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000 The CSA paper by R. Grossi and J.S. Vitter, STOC 2000 Discuss with me ^_^ (