Download presentation
Presentation is loading. Please wait.
Published byVivien Atkins Modified over 6 years ago
1
Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure
Andrzej Ehrenfeucht, University of Colorado, Boulder Ross McConnell, Nissa Osheim, Sung-Whan Woo Colorado State University
2
The Problem Find all places in a text T where a pattern string P occurs as a substring Preprocessing of T is allowed n: length of T m: length of P k: number of occurrences of P in T Assume size of alphabet ∑ is fixed
3
Previous Approaches Suffix trees: O(n) to build; O(m + k) for a query
Compact DAWGs: O(n) to build; O(m+k) for a query Suffix arrays: O(n) to build; O(log n + m + k) for a query (Better bound than the others when alphabet size isn’t fixed)
4
Today’s Talk: Contracted Suffix Trees
O(n) to build; O(m+k) for a query Can be updated efficiently when the text is edited
5
Time Bounds for Editing T
Let h(T) denote the length of the longest substring of T that occurs as many times as it is long Example: T = “abaabababb” “aba” has length 3 and occurs three times No substring of length 4 occurs four times Therefore, h(T) = 3 Can be expected to behave like log n for most applications
6
Time Bounds for Editing T
O([h(T)]^2 + bh(T)) if b consecutive characters are deleted O([h(T')]^2 + bh(T')) if b consecutive characters are inserted, where T' is the text after the insertion O([h(T)]^2) to move b consecutive characters to a new location in T
7
Notation: We refer to a node of a trie by the string that leads to it
Allows us to say things like, “Node A is a prefix of node B,” or “String aba is an ancestor of string abaa.” lambda 1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 a b ba aa ab bb aaa abb aba baa bab bba abab baba babb abaa ababa babab
8
Building a position heap (special case)
1 a b 3 2 12 a 6 b a b 4 5 14 15 17 16 19 18 a b 7 b 8 a 13 a 11 a 9 b 10 b 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b
9
Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10
10
Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10
11
Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10
12
Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10
13
Why wasn’t this discovered earlier?
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b
14
IT DOESN’T WORK! 1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b
15
Observation: any missed occurrence of “aba” is an ancestor of “aba”
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 aba 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b
16
Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b
17
Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b
18
Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b
19
Ehrenfeucht: Search: O(m^2 + k), “overly pessimistic for practical problems” Construction of the data structure: O(n^2), “also overly pessimistic for practical problems” “Can be taught to undergrad data structures students”
20
Relation to previous work
Coffman and Eve ‘70 – Scheme for hashing Ziv-Lempel – Data compression scheme
21
Why Height h is O(h(T))? Longest path from root To leaf
Depth h/2, at least h/2 occurrences h/2 <= h(T); h = O(h(T)) Depth h-1, at least two occurrences Depth h, at least one occurrence
22
Hereditary property: every substring of a node of the position heap is also a node
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 e.g. abaa is a node, and so are its substrings baa, aba, ba, aa, b, a, lambda
23
So far: O(nh(T)) to build the position heap O(m^2 + k) for each query
Next: getting O(m+k) for queries
24
Getting O(m+k) for queries: “red pointers”
1 a b 3 2 a b a b 12 6 4 5 a a b a b a 13 11 9 14 7 8 a b a b 15 17 16 10 a b 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b
25
Searching for “aba” and “bab”
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10
26
Searching on a long string, such as babbabbab
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}: babb|abbab Time: O(1) per character.
27
So far: O(nh(T)) to build the position heap O(m^2 + k) for each query
Next: O(n) to build the position heap
28
G ab bba ba a bb b - abb Position-heap nodes Dual-heap nodes
7 6 5 4 3 2 1 a b G Primal position heap Dual heap 1 1 1 a b a b 4 2 4 2 b a b a b b 7 5 3 5 7 3 a b 6 6 7 6 5 4 3 2 1 ab bba ba a bb b - abb Position-heap nodes Dual-heap nodes The dual heap is a tree because of the hereditary property: Every position has a parent in the dual (the next smaller suffix)
29
Linear algorithm for constructing the position heap
1 1 a b a b 3 2 3 2 a b a b a b a b 4 7 5 6 4 5 7 6 a a a b 13 a b a 13 b a 8 10 9 7 9 10 8 a b a b 11 12 11 12 Primal Heap Dual Heap 13 b 12 11 10 9 8 7 6 5 4 3 2 1 b a
30
Primal-dual approach: amortized analysis to get O(n) time bound
Red pointers are installed in O(n) time by similar approach
31
So far: O(n) to build the position heap O(m + k) for each query
Next: how to modify the structure when the text is edited
32
Contracted Suffix Tree: Drop requirement that
the positions are inserted in order 5 13 a 7 b a b 8 12 19 9 14 4 15 11 18 10 17 16 $ 1 3 b 2 a 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b $
33
Contracted Suffix Tree
The hereditary property no longer applies No dual The locations are pointers, rather than position numbers
34
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 6 3 b a a 8 8 7 7 9 a a b 11 10 12 12 11 10 9 8 7 6 5 4 3 2 1 b a $
35
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 8 8 7 7 9 a a 11 10 12 11 10 9 8 7 6 5 4 3 2 1 b a $
36
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 8 8 7 7 9 a a 11 10 12 11 10 9 8 7 6 5 4 3 2 1 b a $
37
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 5 4 6 12 3 b a a 11 10 7 9 8 12 11 10 9 8 7 5 4 3 2 1 b a $
38
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 5 4 6 12 3 b a a 8 8 7 7 9 a 11 12 11 10 9 8 7 6 5 4 3 2 1 b a $
39
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 8 5 4 6 12 3 b a a Collateral casualties 11 7 10 9 12 11 10 9 8 7 5 4 3 2 1 b a $
40
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 8 5 4 6 12 3 a b a a 7 11 7 10 9 12 11 10 9 8 7 5 4 3 2 1 b a $
41
Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 7 11 7 10 9 8 12 11 10 9 8 7 5 4 3 2 1 b a $
42
Dynamic texts: use contracted suffix tree
O(h(T)) collateral casualties; each takes O(h(T)) time to remove and insert O([h(T)]^2) To delete a block of b characters, each takes O(h(T)) time, and there are O(h(T)) collateral casualties O([h(T)]^2 + bh(T)) To move a block of b characters, you leave everything where it is in the tree, except for the collateral casualties
43
Dynamic texts: use contracted suffix tree
O(n) to build an initial tree (position heap) According to our paper, a data structure for the text must allow lookup by current position number (ugly) O(m log n + k) for each query New: We can use a doubly-linked list to represent the dynamic text and get O(m+k) for queries
44
Why a linked-list representation of the text presents a problem for queries
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}: babb|abbab. Time: O(1) per character.
45
Prefix babb has length 4 and occurs at positions {7,10}
Getting rid of the fancy dynamic text data structure: “Green pointers” on the contracted suffix tree 1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}. Time: O(1) per character: babb|abb|ab
46
Summary O(n) to build an initial tree (position heap)
O([h(T)]^2 + bh(T)) to delete, insert a block of b characters O([h(T)]^2) to move a block of b characters to a new position in the text Simple data structure for the dynamic text O(m+k) queries
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.