Download presentation
Presentation is loading. Please wait.
1
Tries
2
(Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5) Performance: Search ≈ O(|P|) time Space ≈ O(K + N)
3
(Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5)... But in practice… Search: random memory accesses Space: len + pointers + strings Performance: Search ≈ O(|P|) time Space ≈ O(K + N)
4
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory 2 limitations: Sampling rate ≈ lengths of sampled strings Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets
5
An old idea: Patricia Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo [Morrison, J.ACM 1968]
6
A new search ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo…. 2 2 0 y s 1 z s z 5 e i y a c o Search(P): Phase 1: tree navigation Phase 2: Compute LCP Phase 3: tree navigation Three-phase search: P = syzyyea 0 1 25 g < y P’s position Only 1 string is checked Trie Space ≈ #strings, NOT their length [Ferragina-Grossi, J.ACM 1999]
7
….Locality Preserving Front Coding…. PT on all strings 2-level indexing Disk Internal Memory A limitation is n < M Typically 1 I/O What about n > M
8
The String B-tree 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 29 2 26 13 20 25 6 18 3 14 21 23 29 13 20 18 3 23 PT Search(P) O((p/B) log B n) I/Os O(occ/B) I/Os It is dynamic... 1 string checked : O(p/B) O(log B n) levels + Lexicographic position of P [Ferragina-Grossi, J.ACM 1999] Knuth, vol 3°, pag. 489: “elegant”
9
GA 1 6 5 3 6 4 0 456723 AGAGCGC GG AG C A G A G A On Front-Coding… AGAAGA 5 G 3 C 0 GCGCAGA 6 G 4 GGA 6 GA Knuth In-order visit + Path covering Front Coding 3 0 Compacted Trie = FC + tree structure What about other traversals ? FC +... is searchable
10
Why pre-order visit In Front-coding the Lcp information is encoded many times GA 1 6 5 3 6 4 0 456723 AGAGCGC GG AG C A G A G A 3 0 AGAAGA 1 G 3 C 4 GCGCAGA 1 G 3 GGA 1 GA Rear Coding
11
Text Indexing
12
What do we mean by “Indexing” ? Word-based indexes, here a notion of “word” must be devised ! » Inverted lists, Signature files, Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree,...
13
Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes of T T = mississippi mississippi 4,7 P = si T[i,N] iff P is a prefix of the i-th suffix of T (ie. T[i,N]) T P i Pattern P occurs at position i of T From substring search To prefix search Reduction
14
The Suffix Tree T# = mississippi# 2 4 6 8 10 12 118 521109 74 63 0 4 # i ppi# ssi mississippi# 1 p i# pi# 2 1 s i ppi# ssippi# 3 si ssippi# ppi# 1 # ssippi# Label = Space: #nodes Search pattern P Maximal repeated substring = node
15
The Suffix Array Prop 1. All suffixes in SUF(T) with prefix P are contiguous. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA: N log 2 N) bits Text T: N chars In practice, a total of 5N bytes SA 12 11 8 5 2 1 10 9 7 4 6 3 T = mississippi# suffix pointer 5 Prop 2. Starting position is the lexicographic one of P.
16
Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 P = si P is larger 2 accesses per step
17
Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 P = si P is smaller Suffix Array search O (log 2 N) binary-search steps Each step takes O(p) char cmp overall, O (p log 2 N) time + [Manber-Myers, ’90]
18
Locating the occurrences T = mississippi # 4 7 SA 12 11 8 5 2 1 10 9 7 4 6 3 si# occ=2 12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6 3 si$ Suffix Array search O (p + log 2 N + occ) time where # < < $ sissippi sippi
19
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining How long is the common prefix between T[i,...] and T[j,...] ? Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j Lcp 0114001021301140010213 # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# 12 11 8 5 2 1 10 9 7 4 6 3 SA Lcp(7,3) = 1 = min{2,1,3}
20
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does it exist a repeated substring of length ≥ L ? Maximal Lcp of a suffix is with its adjacent Search for Lcp[i] ≥ L Lcp 0114001021301140010213 # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# 12 11 8 5 2 1 10 9 7 4 6 3 SA
21
Lcp 0114001021301140010213 Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does exist a substring of length ≥ L occurring ≥ C times ? Exist ≥ C equal substrings of length ≥ L chars Exist ≥ C suffixes sharing a prefix of ≥ L chars These suffixes may be not contiguous, but... Their “block” has a common prefix of ≥ L chars Search for Lcp[i,i+C-2] whose entries are ≥ L # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# 12 11 8 5 2 1 10 9 7 4 6 3 SA L = 1, C = 4
22
How to construct SA from T ? # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# 12 11 8 5 2 1 10 9 7 4 6 3 SA Elegant but inefficient Obvious inefficiencies: (n 2 log n) time in the worst-case (n log n) cache misses or I/O faults Input: T = mississippi#
23
The skew algorithm The key problem: Compare efficiently two suffixes Brute-force = (n) time per cmp, (n 2 log n) total In order to sort the suffixes of S 1. Divide the suffixes of S in two groups S 0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3 S 1 = suffixes starting at positions 1 mod 3 2a. Sort recursively S 0,2 (they are 2n/3) 2b. Sort S 1 : suffix(3i+1) = S[3i+1] suff(3i+2) 3. Merge the sorted S 0,2 with the sorted S 1 T(n) = O(split) + T(2n/3) + O(|S 1 |) + O(merge) = O(n)
24
Sort recursively S 0,2 We turn this problem into the SA-construction of a shorter string of length (2/3)n. S=AAT GTG AGA TGA $$$ RadixSort all triplets that start at positions 0,2 mod 3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT) Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,… Build s 0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$ 2 6 5 3 7 4 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13
25
Sort recursively S 0,2 Given S=AAT GTG AGA TGA $$$ We have built: s 0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s 0,2 ) = 2 6 5 3 7 4 2 1 It is SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] 1 2 3 4 5 6 7 8 9 10 11 12 13 A suffix of s 0,2 A suffix of enc(s 0,2 ) SA(enc(s 0,2 )) gives SA 0,2 Lex-order is preserved
26
Sort S 1 We turn this problem into the sort of pairs S=AAT GTG AGA TGA $$$ Key observation: suff(1) = = suff(7) = = SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] Suffix of S 1 1 2 3 4 5 6 7 8 9 10 11 12 13 SA 1 = [1, 7, 4, 10]
27
The merge step To merge suffix s i in S 0,2 with suffix s k in S 1, note that If (i mod 3) = 2 s i+1 and s k+1 belong to S 0,2 If (i mod 3) = 0 s i+2 and s k+2 belong to S 0,2 their order can be derived from SA 0,2 in O(1) time SA 1 SA 0,2 SA T(n) = T(2n/3) + O(n) + O(merge) = O(n) S=AAT GTG AGA TGA $$$ 1 2 3 4 5 6 7 8 9 10 11 12 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.