Download presentation
Presentation is loading. Please wait.
Published byClyde Palmer Modified over 9 years ago
1
Alon Efrat Computer Science Department University of Arizona Suffix Trees
2
2 Purpose Given a (very long) text R, preprocess it, so that once a query text P is given, we can efficiently find if P appears in R. (Later – also where P appears in R). Example R= “ HelloWorldWhatANiceDay ”, IsIn( “ World ” ) = YES, IsIn( “ Word ” ) = No IsIn( “ l ” )=8 YES (note – appears more than once)
3
3 Definition: A suffix For a word R, a suffix is what is left of R after deleting the first few characters. All the suffixes of R= “ Hello ” Hello ello llo lo o
4
4 Alg for answering IsIn Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P): Just check if P is in T That is, return find(P). (Here, find is as studied in the lecture on tries)
5
5 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l o
6
6 Lets get greedy Given a (very long) text R, preprocess it, so that once a query text P is given, we can find the location of P in R (if at all) efficiently. More specifically, report the index of where P starts to appear in R. (If more then one answer, report the last one). Example R= “ HelloWorldWhatANiceDay ”, Where( “ World ” ) = 5, that is, the answer is 5, since “ World ” appears starting at index 5 in R. Where( “ Word ” ) = NoWhere Where( “ l ” )=8 (also in other places)
7
7 Alg for answering Where Modify the trie, so that each node also contains a field b_inx. When inserting a word s to the trie, whose first character is in index k of R, modify to nodes along the insertion path to contain the value k. Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P ): Just check if P is in T That is, return find(P), and the value of b_inx where the search terminates. ( Here, find is as studied in the lecture on tries) Resulting DataStructure is called: Uncompressed Suffix Tree
8
8 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l b_inx=0 1 1 1 1 2 3 o b_inx=2 3 4 \
9
9 So much memory ????? The problem with this data structure results from long paths: A sequence of nodes, each but the last one has a single child, and all has the same value of b_inx. h e l o e l l o b_inx=0 h e l o e l l o l l o o l 1 1 1 1 2 3 o b_inx=2 3 4 \
10
10 More examples of paths 0 0 0 0 1 1
11
11 Solution Recall that all strings in the tree are suffixes of the same text R. Add a new field to each node, called c_inx and lng such that if lng>0 then when computing a string, we need to concatenate lng chars from P starting at position c_idx e l l o b_inx=0 h e l l o h c_idx=1, lng=4 e l o e l o R= “ h e l l o ” 0 1 2 3 4 ---------
12
12 Compressing the tree Assuming we are visiting nodes v of the tree, whose distance (num of edges) from the root in the uncompress trie is k. Also assume that v is the first node on a path. Then c_idx = b_idx + k. So the function compress_tree should `know ’ the distance from the root (in the uncompress tree) of the visited node.
13
13 Need a function compress_tree that accepts a node v of the tree, and the depth of v in the uncompressed tree. Also need the function check_path( NODE *p) returning the length (in # edges) of the path starting at *p. So for example if *p has two children, it returns 0;
14
14 Compressing the tree – cont ’ compress_tree( NODE * p, int depth){ for each cell ar[i] of *p if ( (d = check_path (p->ar[i] ) ) > 0 ){ Let q be a pointer to the node at the end of the path. Let h be the length of the path and let d be the depth of q (in the uncompressed tree). Both q, d and h should be obtained from check_path (think how) Set p->ar[i]=q Free unused nodes q -> c_idx = q -> b_idx+depth+1 q -> lng = h compress_tree( q, d ) }
15
15 How large is the tree now Lemma: If T is a tree with no node of degree 1, then the number of nodes is O(number-of-leaves) In our scenario, number-of-leaves<|R| So the size of the trie is O(|R|).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.