Download presentation
Presentation is loading. Please wait.
1
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006
2
Outline The Text Searching Problem What is Full-Text Indexing? Burrows-Wheeler Transform (BWT) BWT as a Full-Text Index Related work
3
Text Searching Text : acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?
4
How fast can we search? Let n be the length of text m be the length of pattern We can find all positions that the pattern appears in O( n + m ) time Knuth-Morris-Pratt, Boyer-Moore Is O(n+m) time good? Yes, because it is optimal!
5
Text Searching (take 2) Pattern: acac Where does the pattern occur in the text? Text : acacaaccagtcacactagac…… we know the text in advance and can preprocess it
6
Can we do better? Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m + ) time, where = number of times the pattern appears in the text Such a data structure is called an index Is O(m+ ) time useful? Yes, if the text is very long and it is searched many times for different patterns
7
Full-Text Index Deals with creating an index for a text Also, each position in the text corresponds to an appearance of at least one pattern (full) Word-Level Index Text is a sequence of words The positions within a word does not correspond to appearance of any pattern E.g., Text: Was it a cat I saw? (Pattern: “at” does not have an appearance)
8
Suffix Tree: An Optimal Full-Text Index As mentioned, we can create an index for the text such that pattern searching can be done in O(m+ ) time This time is optimal One such index is the Suffix Tree Introduced independently by E. McCreight in 1976 and P. Weiner in 1973
9
Suffix and Suffix Tree Given a string S, a substring of S that ends at the last position is called a suffix of S If S consists of n chars, S has exactly n suffixes Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j
10
E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2) acaac# (start at pos 3) caac# (start at pos 4) aac# (start at pos 5) ac# (start at pos 6) c# (start at pos 7) # (start at pos 8) Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S. acacaac# acaac# acacaac# ac#
11
Suffix and Suffix Tree (2) The suffix tree is an edge-labeled compact tree (no degree-1 nodes) with n leaves such that each leaf corresponds to a suffix Concatenating edge labels along the path from root to leaf gives the corresponding suffix Edge-label to each child starts with different character Example (next slide)
12
# c c a a# # c a # a # c a # c a # c a a c # c a a c The Suffix Tree of acacaac# 8 5 3 6 4 2 7 1
13
Searching with Suffix Tree To search P, we match P starting from the root If we can match P successfully in the tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text Then, we traverse the tree under the stop point to report where P appears So, searching is done in O(m+ ) time
14
Is Suffix Tree good? Yes, because optimal search time No, because of space requirement… The space can be much larger than the text E.g., Text = DNA of Human To store the text, we need 0.8 Gbyte To store the suffix tree, we need 64 Gbyte!
15
Something Wrong?? Both the suffix tree and the text has n things, so they both need O(n) space… How come there is a big difference?? Let us have a better analysis Let A be the alphabet (i.e., the set of distinct characters) of a text T E.g., in DNA, A = {a,c,g,t}
16
Something Wrong?? (2) To store T, we need only n log |A| bits But to store the suffix tree, we will need n log n bits When n is very large compared to |A|, there is a huge difference Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??
17
Burrows-Wheeler Transform By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’ Example (next slide)
18
# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac#
19
BWT is useful BWT is shown to be compressed more easily than the original text Also, given the position in the BWT array where the last character appears, we can get back the original text How?
20
# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac# # a a a a c c c Sorted BWT
21
BWT Index Ferragina and Manzini (2000) observes that we can use BWT to support pattern searching by storing some additional O(n)- bit arrays Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)
22
# aac# ac# acaac# acacaac# c# caac# cacaac# c c a c # a a a BWTSuffix in sorted order Text = acacaac#, Pattern = aca # a a a a c c c Sorted BWT
23
BWT Index They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time So, searching is done in O(m + log n) time What is the space? O( n log |A| ) bits
24
Related Work Further compress the index Space is now measured in terms of the entropy (or the randomness) of a text Support text with large alphabet Efficient Construction Challenge is in minimizing working space More complex queries and operations Library problem, Dictionary problem
25
Pointers for Further Study The Pizza & Chili website http://pizzachili.di.unipi.it The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000 The CSA paper by R. Grossi and J.S. Vitter, STOC 2000 Discuss with me ^_^ (email: wkhon@)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.