Tries
(Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5) Performance: Search ≈ O(|P|) time Space ≈ O(K + N)
(Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5)... But in practice… Search: random memory accesses Space: len + pointers + strings Performance: Search ≈ O(|P|) time Space ≈ O(K + N)
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory 2 limitations: Sampling rate ≈ lengths of sampled strings Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets
An old idea: Patricia Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo [Morrison, J.ACM 1968]
A new search ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo… y s 1 z s z 5 e i y a c o Search(P): Phase 1: tree navigation Phase 2: Compute LCP Phase 3: tree navigation Three-phase search: P = syzyyea g < y P’s position Only 1 string is checked Trie Space ≈ #strings, NOT their length [Ferragina-Grossi, J.ACM 1999]
….Locality Preserving Front Coding…. PT on all strings 2-level indexing Disk Internal Memory A limitation is n < M Typically 1 I/O What about n > M
The String B-tree PT Search(P) O((p/B) log B n) I/Os O(occ/B) I/Os It is dynamic... 1 string checked : O(p/B) O(log B n) levels + Lexicographic position of P [Ferragina-Grossi, J.ACM 1999] Knuth, vol 3°, pag. 489: “elegant”
GA AGAGCGC GG AG C A G A G A On Front-Coding… AGAAGA 5 G 3 C 0 GCGCAGA 6 G 4 GGA 6 GA Knuth In-order visit + Path covering Front Coding 3 0 Compacted Trie = FC + tree structure What about other traversals ? FC +... is searchable
Why pre-order visit In Front-coding the Lcp information is encoded many times GA AGAGCGC GG AG C A G A G A 3 0 AGAAGA 1 G 3 C 4 GCGCAGA 1 G 3 GGA 1 GA Rear Coding
Text Indexing
What do we mean by “Indexing” ? Word-based indexes, here a notion of “word” must be devised ! » Inverted lists, Signature files, Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree,...
Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes of T T = mississippi mississippi 4,7 P = si T[i,N] iff P is a prefix of the i-th suffix of T (ie. T[i,N]) T P i Pattern P occurs at position i of T From substring search To prefix search Reduction
The Suffix Tree T# = mississippi# # i ppi# ssi mississippi# 1 p i# pi# 2 1 s i ppi# ssippi# 3 si ssippi# ppi# 1 # ssippi# Label = Space: #nodes Search pattern P Maximal repeated substring = node
The Suffix Array Prop 1. All suffixes in SUF(T) with prefix P are contiguous. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA: N log 2 N) bits Text T: N chars In practice, a total of 5N bytes SA T = mississippi# suffix pointer 5 Prop 2. Starting position is the lexicographic one of P.
Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is larger 2 accesses per step
Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is smaller Suffix Array search O (log 2 N) binary-search steps Each step takes O(p) char cmp overall, O (p log 2 N) time + [Manber-Myers, ’90]
Locating the occurrences T = mississippi # 4 7 SA si# occ= si$ Suffix Array search O (p + log 2 N + occ) time where # < < $ sissippi sippi
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining How long is the common prefix between T[i,...] and T[j,...] ? Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j Lcp # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA Lcp(7,3) = 1 = min{2,1,3}
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does it exist a repeated substring of length ≥ L ? Maximal Lcp of a suffix is with its adjacent Search for Lcp[i] ≥ L Lcp # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA
Lcp Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does exist a substring of length ≥ L occurring ≥ C times ? Exist ≥ C equal substrings of length ≥ L chars Exist ≥ C suffixes sharing a prefix of ≥ L chars These suffixes may be not contiguous, but... Their “block” has a common prefix of ≥ L chars Search for Lcp[i,i+C-2] whose entries are ≥ L # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA L = 1, C = 4
How to construct SA from T ? # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA Elegant but inefficient Obvious inefficiencies: (n 2 log n) time in the worst-case (n log n) cache misses or I/O faults Input: T = mississippi#
The skew algorithm The key problem: Compare efficiently two suffixes Brute-force = (n) time per cmp, (n 2 log n) total In order to sort the suffixes of S 1. Divide the suffixes of S in two groups S 0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3 S 1 = suffixes starting at positions 1 mod 3 2a. Sort recursively S 0,2 (they are 2n/3) 2b. Sort S 1 : suffix(3i+1) = S[3i+1] suff(3i+2) 3. Merge the sorted S 0,2 with the sorted S 1 T(n) = O(split) + T(2n/3) + O(|S 1 |) + O(merge) = O(n)
Sort recursively S 0,2 We turn this problem into the SA-construction of a shorter string of length (2/3)n. S=AAT GTG AGA TGA $$$ RadixSort all triplets that start at positions 0,2 mod 3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT) Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,… Build s 0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$
Sort recursively S 0,2 Given S=AAT GTG AGA TGA $$$ We have built: s 0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s 0,2 ) = It is SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] A suffix of s 0,2 A suffix of enc(s 0,2 ) SA(enc(s 0,2 )) gives SA 0,2 Lex-order is preserved
Sort S 1 We turn this problem into the sort of pairs S=AAT GTG AGA TGA $$$ Key observation: suff(1) = = suff(7) = = SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] Suffix of S SA 1 = [1, 7, 4, 10]
The merge step To merge suffix s i in S 0,2 with suffix s k in S 1, note that If (i mod 3) = 2 s i+1 and s k+1 belong to S 0,2 If (i mod 3) = 0 s i+2 and s k+2 belong to S 0,2 their order can be derived from SA 0,2 in O(1) time SA 1 SA 0,2 SA T(n) = T(2n/3) + O(n) + O(merge) = O(n) S=AAT GTG AGA TGA $$$