Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Why string data are interesting ? They are ubiquitous: l Digital libraries and product catalogues l Electronic white and yellow pages l Specialized information sources (e.g. Genomic or Patent dbs) l Web pages repositories l Private information dbs l... String collections are growing at a staggering rate: l...more than 10Tb of textual data in the web l...more than 15Gb of base pairs in the genomic dbs
The need for an “index” Brute-force scanning is not a viable approach: –Fast single searches –Multiple simple searches for complex queries The index is a basic block of any IR system. An IR system also encompasses: –IR models –Ranking algorithms –Query languages and operations –... We will concentrate only on “index design” !!
Two families of indexes Types of data Linguistic or tokenizable text Raw sequence of characters or bytes Word-based query Character-based query Types of query Two indexing approaches : Word-based indexes, here a concept of “word” must be devised ! » Inverted files, Signature files or Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, Hybrid indexes, or String B-tree. DNA sequences Audio-video files Executables Arbitrary substring Complex matches Exact word Word prefix or suffix Phrase
Word-based: Inverted files (or lists) Now is the time for all good men to come to the aid of their country Doc #1 It was a dark and stormy night in the country manor. The time was past midnight Doc #2 Query answering is a two-phase process: midnight AND time VocabularyPostings 2
Full-text indexes Their need is pervasive: –Raw data: DNA sequences, Audio-Video files,... –Linguistic texts: data mining, statistics,... –Vocabulary for Inverted Lists –Xpath queries on XML documents –Intrusion detection, Anti-viruses,... Classes of indexes: Suffix array, Suffix tree (variants) Multi-level indexes: Short Pat array B-tree based data structures: Prefix B-tree, String B-tree
Terminology l An alphabet, denoted ∑ is a set of (ordered) characters. l A string S is an array of characters, S[1,n] = S[1] S[2] … S[n]. l S[i,j] = S[i] … S[j] is a substring of S. l S[1,j] is a prefix of S; S[i,n] is a suffix of S. l ∑* denotes all strings over alphabet ∑. l Lexicographic order: –Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z, the lexicographic order is the same as in a dictionary. SUF(T) = Sorted set of suffixes of T
Indexed string matching problem l Let T be a set of K strings in ∑*, where N is the total length of all strings in T. l String matching query on T: Given a pattern P find all occurrences of P in the strings in T. l Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index. l Dynamic version: Supports also insertions and deletions of strings in the full-text index.
A simple but crucial observation Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n] T P i T[i,n] Occurrences of P in T = All suffixes of T having P as a prefix T = This is a visual example This is a visual example 3,6,12 Can transform the string matching problem to a “prefix matching problem” over all the suffixes.
Suffix Array [Manber-Myers, 90] Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order. T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA: array of ints, 4N bytes Text T: N bytes 5N bytes of space occupancy (N 2 ) space SA T = mississippi# suffix pointer 5
Two key properties [Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) SA T = mississippi#
Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA si P is larger 2 accesses for binary step
Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA si P is smaller
Listing the occurrences [Manber-Myers, 90] Brute-force comparison: O(p x occ) time T = mississippi# SA si occ=2 Suffix Array search O (p (log 2 N + occ)) time O (log 2 N + occ) in practice External memory Simple disk paging for SA O ((p/B) (log 2 N + occ)) I/Os issippi P is not a prefix P is a prefix sissippi P is a prefix sippi log B N + occ/B
SA Lcp Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA Output-sensitive retrieval T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) P=si Suffix Array search O ((p/B) log 2 N + (occ/B)) I/Os 9 N bytes of space Scan Lcp until Lcp[i] < p + : incremental search base B : tricky !! Compare against P occ=2
q Incremental search (case 1) Incremental search using the LCP array: no rescanning of pattern chars SA Range Minima i j The cost: O (1) memory accesses P Min Lcp[i,q-1] > P’s known induct. < P’s > P’s
q Incremental search (case 2) Incremental search using the LCP array: no rescanning of pattern chars SA i j Min Lcp[i,q-1] The cost: O (1) memory accesses Range Minima known induct. < P’s > P’s P
Incremental search (case 3) Incremental search using the LCP array: no rescanning of pattern chars SA i j q Min Lcp[i,q-1] Suffix char > Pattern char Suffix char < Pattern char Suffix Array search O(log 2 N) binary steps O(p) total char-cmp for routing O((p/B) + log 2 N + (occ/B)) I/Os base B : more tricky (we will not consider this) L The cost: O( L ) char cmp Range Minima < P’s > P’s P
Summary: suffix array [Manber-Myers, 90] l Space: O(N) l String matching queries: O(p + log N + occ) l Can be constructed O(N log N) time. l Static.
SA Hybrid Index: Short Pat array Exploit internal memory : sample the suffix array and copy something in memory M Disk P binary-search inside SA + Supra-index O((p/B) + log 2 (N/s) + (occ/B)) I/Os Parameter s depends on M and influences both performance and space !! (only a huristic) s Copy a prefix of marked suffixes
Tries l Trie (name from the word “retrieval”): a data structure for storing a set of strings –Let’s assume that all strings end with “$” (not in ) b s e a r $ i d $ u l k $ $ l u n d a y $ $ Set of strings: { bear, bid, bulk, bull, sun, sunday }
Tries l Properties of a trie: –A multi-way tree. –Each node has from 1 to 1 children. –Each edge of the tree is labeled with a character. –Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node.
Analysis of the Trie Given k strings of total length N: l Size: –O(N) in the worst-case l Search, insertion, and deletion (string of length m): –O(m) (assuming ∑ is constant) l Observation: –Having chains of one-child nodes is wasteful
Compact Tries l Compact Trie: –Replace a chain of one-child nodes with an edge labeled with a string –Each non-leaf node (except root) has at least two children b s e a r $ i d $ u l k $ $ l u n d a y $ $ b sun ear$ id$ ul k$ l$ day$ $
Compact Tries l Implementation: –Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) –Improves the space to O(k) b sun ear$ id$ ul k$ l$ day$ $ bear$bid$bulk$bull$sun$sunday$ T:T: 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5
Patricia Tries l Patricia trie: –a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1) bear bid bulk bull sun sunday T:T: b,1 s,3 e,4 i,3 u,2 k,2 l,2 d,4 _,1 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5
Suffix Trees [McCreight, 76] l Suffix tree – a compact trie (or similar structure) of all suffixes of the text –Patricia trie of suffixes is sometimes called a Pat tree carrara$ a r r $ carrara$ rara$ a$ rara$ a $ ra$ (a,1)(r,1) ($,1) (c,8) (r,5) (a,2) (r,5) (a,1) (r,3) ($,1)
Search in suffix trees T = abababbc# c c c b a b b b b b a a b c a b b c a b c b P = ba c b 0 1 (5,8) O(N) space O(p) time What about ST in external memory ? –Unbalanced tree topology –Updates CPAT tree ~ 5N on average No (p/B), possibly no (occ/B), mainly static and space costly 768 c Packing ?! (p) I/Os (occ) I/Os?? Search is a path traversal 2 4 and O(occ) time - Large space ~ 15N b a
Summary: compact trie K strings of total length N: l Space: O(K) l String matching queries: O(p + occ) time l Can be constructed O(N) time. l Update(S): O(|S|) time
Summary: suffix tree [McCreight, 76] For a string of length n: l Space: O(n) l String matching queries: O(p + occ) l Can be constructed O(n) time. l Static.
The String B-tree ( An I/O-efficient full-text index !!) [Ferragina-Grossi, 95]
The prologue We are left with many open issues: –Suffix Array: updates –Hybrid: Heuristic tuning of the performance –Suffix tree: difficult packing and (p) I/Os B-tree is ubiquitous in large-scale applications: –Atomic keys: integers, reals,... –Prefix B-tree: bounded length keys ( 255 chars) Suffix trees + B-trees ?String B-tree [Ferragina-Grossi, 95] Index unbounded length keys Good worst-case I/O-bounds in search and update
Some considerations Strings have arbitrary length: –Disk page cannot ensure the storage of (B) strings –M may be unable to store even one single string String storage: –Pointers allow to fit (B) strings per disk page –String comparison needs disk access and may be expensive String pointers organization seen so far: Suffix array: simple but static and not optimal Patricia trie: sophisticated and much efficient (optimal ?) Recall the problem: is a string collection Search( P[1,p] ): retrieve all occurrences of P in ’s strings Update( S[1,t] ): insert or delete a text S from
1º step: B-tree on string pointers AATCAGCGAATGCTGCTT CTGTTGATGA Disk P = AT O((p/B) log 2 B) I/Os O(log B N) levels Search(P) O ((p/B) log 2 N) I/Os O (occ/B) I/Os It is dynamic !! O(s (s/B) log 2 N) I/Os + B
2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA Disk A G A (1; 1,3)(4; 1,4) (6; 5,6) (5; 5,6) (2; 1,2) (3; 4,4) (1; 6,6) (2; 6,6) (4; 7,7) (5; 7,7) (7; 7,8) (6; 7,7) G
2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA Disk A 0 G A A A A 4 GA 6 7 GG G C Space PT O(k), not O(N) Two-phase search: P = GCACGCAC G 1 G 57 A Search(P): First phase: no string access Second phase: O(p/B) I/Os mismatch P’s position max LCP with P Just one string is checked !! G C
3º step: B-tree + Patricia tree AATCAGCGAATGCTGCTT CTGTTGATGA Disk PT Search(P) O((p/B) log B N) I/Os O(occ/B) I/Os Insert(S) O( s (s/B) log B N) I/Os O(p/B) I/Os O(log B N) levels P = AT +
PT Level i Search(P) O(log B N) I/Os just to go to the leaf level 4º step: Incremental Search Max_lcp(i) P Level i+1 PT First case Leaf Level adjacent
Level i+1 PT Inductive Step PT Level i Search(P) O(p/B + log B N) I/Os O(occ/B) I/Os 4º step: Incremental Search Max_lcp(i) P Max_lcp(i+1) skip Max_lcp(i) i-th step: O(( lcp i+1 – lcp i )/B + 1) I/Os No rescanning Second case
Summary: String B-tree String B-tree performance: [Ferragina-Grossi, 95] –Search(P) takes O(p/B + log B N + occ/B) I/Os –Update(S) takes O( s log B N ) I/Os –Space is (N/B) disk pages Using the String B-tree in internal memory: –Search(P) takes O(p + log 2 N + occ) time –Update(S) takes O( s log 2 N ) time –Space is (N) bytes It is a sort of dynamic suffix array
Summary l Indexed string matching problem –Word-based –Full-text l Internal memory data structures –Suffix array –Suffix tree l External memory data structures –Patricia trie and Pat tree –Short Pat array –String B-tree