Presentation is loading. Please wait.

Presentation is loading. Please wait.

Andrzej Ehrenfeucht, University of Colorado, Boulder

Similar presentations


Presentation on theme: "Andrzej Ehrenfeucht, University of Colorado, Boulder"— Presentation transcript:

1 Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure
Andrzej Ehrenfeucht, University of Colorado, Boulder Ross McConnell, Nissa Osheim, Sung-Whan Woo Colorado State University

2 The Problem Find all places in a text T where a pattern string P occurs as a substring Preprocessing of T is allowed n: length of T m: length of P k: number of occurrences of P in T Assume size of alphabet ∑ is fixed

3 Previous Approaches Suffix trees: O(n) to build; O(m + k) for a query
Compact DAWGs: O(n) to build; O(m+k) for a query Suffix arrays: O(n) to build; O(log n + m + k) for a query (Better bound than the others when alphabet size isn’t fixed)

4 Today’s Talk: Contracted Suffix Trees
O(n) to build; O(m+k) for a query Can be updated efficiently when the text is edited

5 Time Bounds for Editing T
Let h(T) denote the length of the longest substring of T that occurs as many times as it is long Example: T = “abaabababb” “aba” has length 3 and occurs three times No substring of length 4 occurs four times Therefore, h(T) = 3 Can be expected to behave like log n for most applications

6 Time Bounds for Editing T
O([h(T)]^2 + bh(T)) if b consecutive characters are deleted O([h(T')]^2 + bh(T')) if b consecutive characters are inserted, where T' is the text after the insertion O([h(T)]^2) to move b consecutive characters to a new location in T

7 Notation: We refer to a node of a trie by the string that leads to it
Allows us to say things like, “Node A is a prefix of node B,” or “String aba is an ancestor of string abaa.” lambda 1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 a b ba aa ab bb aaa abb aba baa bab bba abab baba babb abaa ababa babab

8 Building a position heap (special case)
1 a b 3 2 12 a 6 b a b 4 5 14 15 17 16 19 18 a b 7 b 8 a 13 a 11 a 9 b 10 b 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b

9 Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10

10 Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10

11 Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10

12 Searching for “aba” a b 1 2 3 4 5 a b 6 19 18 17 16 15 14 13 12 11 10

13 Why wasn’t this discovered earlier?
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b

14 IT DOESN’T WORK! 1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b

15 Observation: any missed occurrence of “aba” is an ancestor of “aba”
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 aba 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b

16 Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b

17 Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b

18 Solution: check all m ancestors of P to see if it gives an occurrence of P
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 P 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b a b

19 Ehrenfeucht: Search: O(m^2 + k), “overly pessimistic for practical problems” Construction of the data structure: O(n^2), “also overly pessimistic for practical problems” “Can be taught to undergrad data structures students”

20 Relation to previous work
Coffman and Eve ‘70 – Scheme for hashing Ziv-Lempel – Data compression scheme

21 Why Height h is O(h(T))? Longest path from root To leaf
Depth h/2, at least h/2 occurrences h/2 <= h(T); h = O(h(T)) Depth h-1, at least two occurrences Depth h, at least one occurrence

22 Hereditary property: every substring of a node of the position heap is also a node
1 2 3 4 5 a b 12 6 13 11 9 14 7 8 15 17 16 10 19 18 e.g. abaa is a node, and so are its substrings baa, aba, ba, aa, b, a, lambda

23 So far: O(nh(T)) to build the position heap O(m^2 + k) for each query
Next: getting O(m+k) for queries

24 Getting O(m+k) for queries: “red pointers”
1 a b 3 2 a b a b 12 6 4 5 a a b a b a 13 11 9 14 7 8 a b a b 15 17 16 10 a b 19 18 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b

25 Searching for “aba” and “bab”
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10

26 Searching on a long string, such as babbabbab
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}: babb|abbab Time: O(1) per character.

27 So far: O(nh(T)) to build the position heap O(m^2 + k) for each query
Next: O(n) to build the position heap

28 G ab bba ba a bb b - abb Position-heap nodes Dual-heap nodes
7 6 5 4 3 2 1 a b G Primal position heap Dual heap 1 1 1 a b a b 4 2 4 2 b a b a b b 7 5 3 5 7 3 a b 6 6 7 6 5 4 3 2 1 ab bba ba a bb b - abb Position-heap nodes Dual-heap nodes The dual heap is a tree because of the hereditary property: Every position has a parent in the dual (the next smaller suffix)

29 Linear algorithm for constructing the position heap
1 1 a b a b 3 2 3 2 a b a b a b a b 4 7 5 6 4 5 7 6 a a a b 13 a b a 13 b a 8 10 9 7 9 10 8 a b a b 11 12 11 12 Primal Heap Dual Heap 13 b 12 11 10 9 8 7 6 5 4 3 2 1 b a

30 Primal-dual approach: amortized analysis to get O(n) time bound
Red pointers are installed in O(n) time by similar approach

31 So far: O(n) to build the position heap O(m + k) for each query
Next: how to modify the structure when the text is edited

32 Contracted Suffix Tree: Drop requirement that
the positions are inserted in order 5 13 a 7 b a b 8 12 19 9 14 4 15 11 18 10 17 16 $ 1 3 b 2 a 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a b $

33 Contracted Suffix Tree
The hereditary property no longer applies No dual The locations are pointers, rather than position numbers

34 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 6 3 b a a 8 8 7 7 9 a a b 11 10 12 12 11 10 9 8 7 6 5 4 3 2 1 b a $

35 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 8 8 7 7 9 a a 11 10 12 11 10 9 8 7 6 5 4 3 2 1 b a $

36 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 8 8 7 7 9 a a 11 10 12 11 10 9 8 7 6 5 4 3 2 1 b a $

37 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 5 4 6 12 3 b a a 11 10 7 9 8 12 11 10 9 8 7 5 4 3 2 1 b a $

38 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 5 4 6 12 3 b a a 8 8 7 7 9 a 11 12 11 10 9 8 7 6 5 4 3 2 1 b a $

39 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 7 8 5 4 6 12 3 b a a Collateral casualties 11 7 10 9 12 11 10 9 8 7 5 4 3 2 1 b a $

40 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 8 5 4 6 12 3 a b a a 7 11 7 10 9 12 11 10 9 8 7 5 4 3 2 1 b a $

41 Modifying the contracted suffix tree when you edit T
a b 1 2 a b a b 5 4 6 12 3 b a a 7 11 7 10 9 8 12 11 10 9 8 7 5 4 3 2 1 b a $

42 Dynamic texts: use contracted suffix tree
O(h(T)) collateral casualties; each takes O(h(T)) time to remove and insert O([h(T)]^2) To delete a block of b characters, each takes O(h(T)) time, and there are O(h(T)) collateral casualties O([h(T)]^2 + bh(T)) To move a block of b characters, you leave everything where it is in the tree, except for the collateral casualties

43 Dynamic texts: use contracted suffix tree
O(n) to build an initial tree (position heap) According to our paper, a data structure for the text must allow lookup by current position number (ugly) O(m log n + k) for each query New: We can use a doubly-linked list to represent the dynamic text and get O(m+k) for queries

44 Why a linked-list representation of the text presents a problem for queries
1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}: babb|abbab. Time: O(1) per character.

45 Prefix babb has length 4 and occurs at positions {7,10}
Getting rid of the fancy dynamic text data structure: “Green pointers” on the contracted suffix tree 1 2 3 a b 19 18 4 5 12 6 13 11 9 14 7 8 15 17 16 10 Prefix babb has length 4 and occurs at positions {7,10} Recurse on rest of string to see whether it occurs at positions {7-4, 10-4}. Time: O(1) per character: babb|abb|ab

46 Summary O(n) to build an initial tree (position heap)
O([h(T)]^2 + bh(T)) to delete, insert a block of b characters O([h(T)]^2) to move a block of b characters to a new position in the text Simple data structure for the dynamic text O(m+k) queries


Download ppt "Andrzej Ehrenfeucht, University of Colorado, Boulder"

Similar presentations


Ads by Google