Download presentation
Presentation is loading. Please wait.
Published byGeoffrey Summers Modified over 9 years ago
1
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo
2
The Problem Initial Problem Text searching: Finding occurrences of a pattern string in a large (static) document Solution Text indexing: Trading space for time New Problem Succinct Text indexes: Reducing the space cost
3
Pattern Searching Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. Three types of Queries Existential queries: Does P occur in T? Cardinality queries: How many times does P occur in T? Listing queries: Where does P occur in T?
4
Text Indexing Inverted files Word index Need to store the text as well as the index Suffix trees Efficient full-text index 4n lg n to 6n lg n bits! Suffix arrays n lg n bits in basic form, but 3n lg n bits (with LCP data)
5
Applications Text databases electronic encyclopedias, dictionaries, books, etc. Web search engines Google, Altavista, etc. Bioinformatics gene databases More…
6
Related Work Compressed Suffix Arrays Grossi & Vitter 2000 Sadakane 2000 Grossi, Gupta & Vitter 2003 FM-index Ferragina & Manzini 2000 & 2001
7
Assumptions & Notation Alphabet: Σ = {a, b} Text: T[1..n] T[n] = #, where a < # < b Pattern: P[1..m]
8
Permutations and Suffix Arrays An observation Permutations: n! Suffix arrays: 2 n-1 Not all permutations are suffix arrays An example A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 Text: abbaaba# A permutation: 4, 7, 1, 5, 8, 2, 3, 6 Not a suffix array of any binary text
9
Two Features of Suffix Arrays Suffix Array Another Permutation 4 7 5 1 8 3 6 2 4 7 1 5 8 2 3 6 Ascending-to-max Non-nesting
10
A Categorization Theorem A permutation is a suffix array iff it is: Ascending-to-max Non-nesting An immediate application: Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.
11
Application: Space Efficient Suffix Array Text: abaaabbaaabaabb# 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Ba:Ba: Bb:Bb: 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0
12
Basic Searching Algorithm: Answering Cardinality Queries Basic Idea: backward search Start from the end of the pattern P For i = m, m-1, …, 1, compute the interval [s, e] of SA whose corresponding suffixes are prefixed with P[i, m] 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: P = aba
13
More Algorithms and Tradeoffs Answering listing queries Speeding up the reporting of Occurrences of Long Patterns Self-indexing Time-space tradeoff: multi-level structure
14
Putting it all together space (bits)pattern searching Index 1n+o(n)O(m) (existential & cardinality queries only) Index 22n+o(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg n) (otherwise) Index 3O(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg λ n) (otherwise) Three index structures:
15
Conclusion Summary A theorem that characterizes a permutation as the suffix array of a binary string An efficient algorithm checking whether a permutation is a suffix array Three space efficient text indexing methods
16
Conclusions (Continued) Related subsequent work Generalization to larger alphabets Open problem O(n)-bits text index supporting searching in O(m+occ) time.
17
Thank You.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.