Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Similar presentations


Presentation on theme: "Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]"— Presentation transcript:

1 Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

2 Why string data are interesting ? They are ubiquitous: l Digital libraries and product catalogues l Electronic white and yellow pages l Specialized information sources (e.g. Genomic or Patent dbs) l Web pages repositories l Private information dbs l... String collections are growing at a staggering rate: l...more than 10Tb of textual data in the web l...more than 15Gb of base pairs in the genomic dbs

3 The need for an “index” Brute-force scanning is not a viable approach: –Fast single searches –Multiple simple searches for complex queries The index is a basic block of any IR system. An IR system also encompasses: –IR models –Ranking algorithms –Query languages and operations –... We will concentrate only on “index design” !!

4 Two families of indexes Types of data Linguistic or tokenizable text Raw sequence of characters or bytes Word-based query Character-based query Types of query Two indexing approaches : Word-based indexes, here a concept of “word” must be devised ! » Inverted files, Signature files or Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, Hybrid indexes, or String B-tree. DNA sequences Audio-video files Executables Arbitrary substring Complex matches Exact word Word prefix or suffix Phrase

5 Word-based: Inverted files (or lists) Now is the time for all good men to come to the aid of their country Doc #1 It was a dark and stormy night in the country manor. The time was past midnight Doc #2  Query answering is a two-phase process: midnight AND time VocabularyPostings 2

6 Full-text indexes Their need is pervasive: –Raw data: DNA sequences, Audio-Video files,... –Linguistic texts: data mining, statistics,... –Vocabulary for Inverted Lists –Xpath queries on XML documents –Intrusion detection, Anti-viruses,... Classes of indexes:  Suffix array, Suffix tree (variants)  Multi-level indexes: Short Pat array  B-tree based data structures: Prefix B-tree, String B-tree

7 Terminology l An alphabet, denoted ∑ is a set of (ordered) characters. l A string S is an array of characters, S[1,n] = S[1] S[2] … S[n]. l S[i,j] = S[i] … S[j] is a substring of S. l S[1,j] is a prefix of S; S[i,n] is a suffix of S. l ∑* denotes all strings over alphabet ∑. l Lexicographic order: –Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z, the lexicographic order is the same as in a dictionary. SUF(T) = Sorted set of suffixes of T

8 Indexed string matching problem l Let T be a set of K strings in ∑*, where N is the total length of all strings in T. l String matching query on T: Given a pattern P find all occurrences of P in the strings in T. l Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index. l Dynamic version: Supports also insertions and deletions of strings in the full-text index.

9 A simple but crucial observation Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n] T P i T[i,n] Occurrences of P in T = All suffixes of T having P as a prefix T = This is a visual example This is a visual example 3,6,12 Can transform the string matching problem to a “prefix matching problem” over all the suffixes.

10 Suffix Array [Manber-Myers, 90] Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order. T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA: array of ints, 4N bytes Text T: N bytes  5N bytes of space occupancy  (N 2 ) space SA 12 11 8 5 2 1 10 9 7 4 6 3 T = mississippi# suffix pointer 5

11 Two key properties [Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 T = mississippi#

12 Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 si P is larger 2 accesses for binary step

13 Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 si P is smaller

14 Listing the occurrences [Manber-Myers, 90] Brute-force comparison: O(p x occ) time T = mississippi# 4 6 7 SA 12 11 8 5 2 1 10 9 7 4 6 3 si occ=2 Suffix Array search O (p (log 2 N + occ)) time O (log 2 N + occ) in practice External memory Simple disk paging for SA O ((p/B) (log 2 N + occ)) I/Os issippi P is not a prefix P is a prefix sissippi P is a prefix sippi log B N + occ/B 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3

15 SA 12 11 8 5 2 1 10 9 7 4 6 3 Lcp 0014001021300140010213 Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA Output-sensitive retrieval T = mississippi# 4 6 7 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) P=si Suffix Array search O ((p/B) log 2 N + (occ/B)) I/Os 9 N bytes of space Scan Lcp until Lcp[i] < p + : incremental search base B : tricky !! Compare against P 0014001021300140010213 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 0014001021300140010213 occ=2

16 q Incremental search (case 1) Incremental search using the LCP array: no rescanning of pattern chars SA Range Minima i j The cost: O (1) memory accesses P Min Lcp[i,q-1] > P’s known induct. < P’s > P’s

17 q Incremental search (case 2) Incremental search using the LCP array: no rescanning of pattern chars SA i j Min Lcp[i,q-1] The cost: O (1) memory accesses Range Minima known induct. < P’s > P’s P

18 Incremental search (case 3) Incremental search using the LCP array: no rescanning of pattern chars SA i j q Min Lcp[i,q-1] Suffix char > Pattern char Suffix char < Pattern char Suffix Array search O(log 2 N) binary steps O(p) total char-cmp for routing  O((p/B) + log 2 N + (occ/B)) I/Os base B : more tricky (we will not consider this) L The cost: O( L ) char cmp Range Minima < P’s > P’s P

19 Summary: suffix array [Manber-Myers, 90] l Space: O(N) l String matching queries: O(p + log N + occ) l Can be constructed O(N log N) time. l Static.

20 SA Hybrid Index: Short Pat array Exploit internal memory : sample the suffix array and copy something in memory M Disk P binary-search inside SA + Supra-index O((p/B) + log 2 (N/s) + (occ/B)) I/Os  Parameter s depends on M and influences both performance and space !! (only a huristic) s Copy a prefix of marked suffixes

21 Tries l Trie (name from the word “retrieval”): a data structure for storing a set of strings –Let’s assume that all strings end with “$” (not in  ) b s e a r $ i d $ u l k $ $ l u n d a y $ $ Set of strings: { bear, bid, bulk, bull, sun, sunday }

22 Tries l Properties of a trie: –A multi-way tree. –Each node has from 1 to  1 children. –Each edge of the tree is labeled with a character. –Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node.

23 Analysis of the Trie Given k strings of total length N: l Size: –O(N) in the worst-case l Search, insertion, and deletion (string of length m): –O(m) (assuming ∑ is constant) l Observation: –Having chains of one-child nodes is wasteful

24 Compact Tries l Compact Trie: –Replace a chain of one-child nodes with an edge labeled with a string –Each non-leaf node (except root) has at least two children b s e a r $ i d $ u l k $ $ l u n d a y $ $ b sun ear$ id$ ul k$ l$ day$ $

25 Compact Tries l Implementation: –Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) –Improves the space to O(k) b sun ear$ id$ ul k$ l$ day$ $ 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 bear$bid$bulk$bull$sun$sunday$ T:T: 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5

26 Patricia Tries l Patricia trie: –a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1) 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 bear bid bulk bull sun sunday T:T: b,1 s,3 e,4 i,3 u,2 k,2 l,2 d,4 _,1 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5

27 Suffix Trees [McCreight, 76] l Suffix tree – a compact trie (or similar structure) of all suffixes of the text –Patricia trie of suffixes is sometimes called a Pat tree carrara$ 1 2 3 4 5 6 7 8 a r r $ carrara$ rara$ a$ rara$ a $ ra$ 2 5 7 1 6 4 3 (a,1)(r,1) ($,1) (c,8) (r,5) (a,2) 2 5 7 1 6 4 3 (r,5) (a,1) (r,3) ($,1)

28 Search in suffix trees T = abababbc# 1 3 5 7 9 3 2 4 c c c b a b b b b b a 2 4 1 35 a b c a b b c a b c b P = ba c b 0 1 (5,8) O(N) space O(p) time What about ST in external memory ? –Unbalanced tree topology –Updates CPAT tree ~ 5N on average No (p/B), possibly no (occ/B), mainly static and space costly 768 c Packing ?!  (p) I/Os  (occ) I/Os??  Search is a path traversal 2 4 and O(occ) time - Large space ~ 15N b a

29 Summary: compact trie K strings of total length N: l Space: O(K) l String matching queries: O(p + occ) time l Can be constructed O(N) time. l Update(S): O(|S|) time

30 Summary: suffix tree [McCreight, 76] For a string of length n: l Space: O(n) l String matching queries: O(p + occ) l Can be constructed O(n) time. l Static.

31 The String B-tree ( An I/O-efficient full-text index !!) [Ferragina-Grossi, 95]

32 The prologue We are left with many open issues: –Suffix Array: updates –Hybrid: Heuristic tuning of the performance –Suffix tree: difficult packing and  (p) I/Os B-tree is ubiquitous in large-scale applications: –Atomic keys: integers, reals,... –Prefix B-tree: bounded length keys (  255 chars) Suffix trees + B-trees ?String B-tree [Ferragina-Grossi, 95]  Index unbounded length keys  Good worst-case I/O-bounds in search and update

33 Some considerations Strings have arbitrary length: –Disk page cannot ensure the storage of  (B) strings –M may be unable to store even one single string String storage: –Pointers allow to fit  (B) strings per disk page –String comparison needs disk access and may be expensive String pointers organization seen so far:  Suffix array: simple but static and not optimal  Patricia trie: sophisticated and much efficient (optimal ?) Recall the problem:  is a string collection Search( P[1,p] ): retrieve all occurrences of P in  ’s strings Update( S[1,t] ): insert or delete a text S from 

34 1º step: B-tree on string pointers AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 Disk 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 29 2 26 13 20 25 6 18 3 14 21 23 29 13 20 18 3 23 P = AT O((p/B) log 2 B) I/Os O(log B N) levels Search(P) O ((p/B) log 2 N) I/Os O (occ/B) I/Os It is dynamic !! O(s (s/B) log 2 N) I/Os + B

35 2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA 6 5 3 6 4 0 Disk A G A 456723 (1; 1,3)(4; 1,4) (6; 5,6) (5; 5,6) (2; 1,2) (3; 4,4) (1; 6,6) (2; 6,6) (4; 7,7) (5; 7,7) (7; 7,8) (6; 7,7) G

36 2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA 6 5 3 Disk A 0 G 45623 A A A A 4 GA 6 7 GG G C Space PT O(k), not O(N) Two-phase search: P = GCACGCAC G 1 G 57 A Search(P): First phase: no string access Second phase: O(p/B) I/Os mismatch P’s position max LCP with P Just one string is checked !! G C

37 3º step: B-tree + Patricia tree AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30 Disk 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 29 2 26 13 20 25 6 18 3 14 21 23 29 13 20 18 3 23 PT Search(P) O((p/B) log B N) I/Os O(occ/B) I/Os Insert(S) O( s (s/B) log B N) I/Os O(p/B) I/Os O(log B N) levels P = AT +

38 PT Level i Search(P) O(log B N) I/Os just to go to the leaf level 4º step: Incremental Search Max_lcp(i) P Level i+1 PT First case Leaf Level adjacent

39 Level i+1 PT Inductive Step PT Level i Search(P) O(p/B + log B N) I/Os O(occ/B) I/Os 4º step: Incremental Search Max_lcp(i) P Max_lcp(i+1) skip Max_lcp(i) i-th step: O(( lcp i+1 – lcp i )/B + 1) I/Os No rescanning Second case

40 Summary: String B-tree String B-tree performance: [Ferragina-Grossi, 95] –Search(P) takes O(p/B + log B N + occ/B) I/Os –Update(S) takes O( s log B N ) I/Os –Space is  (N/B) disk pages Using the String B-tree in internal memory: –Search(P) takes O(p + log 2 N + occ) time –Update(S) takes O( s log 2 N ) time –Space is  (N) bytes  It is a sort of dynamic suffix array

41 Summary l Indexed string matching problem –Word-based –Full-text l Internal memory data structures –Suffix array –Suffix tree l External memory data structures –Patricia trie and Pat tree –Short Pat array –String B-tree


Download ppt "Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]"

Similar presentations


Ads by Google