Download presentation
Presentation is loading. Please wait.
Published byGodwin Stevens Modified over 9 years ago
1
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan August 24-26, 2015 PSC2015
2
The string which is common among some documents characterizes a set of documents. Characteristic String of Documents T1T1 = praguestringabc T2T2 = bacompscienceap T3T3 = apscreenapscite T4T4 = strconferenceab T5T5 = wepscompresseda
3
W D (x) : number of distinct strings in D which have x as a substring. d-Right-Maximal Generic Words Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -right-maximal extensions of P. Problem [Kucherov et al., SPIRE 2012] A string x is a d -right-maximal extension of P if P is a prefix of x W D (x) ≥ d W D (xa) < d for any character a.
4
Example d-Right-Maximal Generic Words Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -right-maximal extensions of P. Problem [Kucherov et al., SPIRE 2012] T1T1 = ababaabaaaacb T2T2 = cbaabacabaabc T3T3 = bbabaaca P = aa, d = 2 output = { aaba, aac }
5
d-Right-Maximal Generic Words There exists an O(n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + rocc) time. The data structure can be constructed in O(n) time. Theorem [Kucherov et al., SPIRE 2012] n : total length of strings in D rocc : number of d -right-maximal extensions of P Each d -right-maximal extension is corresponds to a branching node in generalized suffix tree of D.
6
Each leaf of generalized suffix tree of D corresponds to a suffix of a string in D. Generalized Suffix Tree (GST) Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa
7
Notations on generalized suffix tree of D Generalized Suffix Tree (GST) GST D : generalized suffix tree of D GST D (u) : subtree rooted at a node u str D (u) : string which is represented by a node u in GST D weight D (u) : = W D (str D (u)) maxchild D (u) : maximum weight of child of u L(P) : locus of P
8
Each answer corresponds to a branching node in GST D (L(P)). d-Right-Maximal Generic Words Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa 1 2 3 $3$3 1123323112313212 3 $1$1 $2$2 $3$3 b a a b a b a a a $1$1 b b b $1$1 $2$2 $3$3 $1$1 $3$3 $3$3 $3$3 $2$2 a a $3$3 $2$2 a a a a a a $2$2 $1$1 $2$2 b b $1$1 b $1$1 P = ab, d = 2 output = { abaa } L(P)L(P) weight D (u) ≥ 2, maxchild D (u) < 2
9
New Problem Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem Example T1T1 = ababaabaaaacb T2T2 = cbaabacabaabc T3T3 = bbabaaca P = aa, d = 2 output = { baaba, abaab, babaa }
10
Our Contribution Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem n : total length of strings in D rocc : number of d -right-maximal extensions of P occ : number of d -left-right-maximal extensions of P
11
Each answer may not correspond to a branching node in GST D (L(P)). d-Left-Right-Maximal Generic Words Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa P = ab, d = 2 output = { abaa, aaba }
12
Main Idea TiTi P Each d -left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.
13
Main Idea TiTi P If we check d -left-maximal extension of all right extensions of P, we can obtain all answers. We consider such extensions on GST.
14
For any branching right ( not necessary maximal ) extension of P, we compute its d -left-maximal extension. Main Idea GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d weight D R (v) ≥ d maxchild D R (v) < d Such nodes v are candidates of answers. L(str(u) R ) = r(u) L(P)L(P) u D R = {T 1 R, …, T m R }
15
Main Idea GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d L(str(u) R ) = r(u) L(P)L(P) u cand(u) REx Cand(REx) = ∪ u ∈ REx cand(u) set of candidates
16
Cand ( REx ) may contains non-answers. We want to remove such nodes from Cand ( REx ), so we characterize above nodes. Cand(REx)
17
The nodes in Cand ( REx ) which are not answers are not d -right-maximal. Non-answers TiTi P × × We should check weather d -right-maximal or not. To do so, we need information of node r’(v) for each node v in GST D R. r’(v) : node in GST D s.t. str(v) R = str(r’(v)) (It may be implicit node.)
18
Remove non-answers GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d weight D R (v) ≥ d maxchild D R (v) < d L(str(u) R ) = r(u) L(P)L(P) u r’(v) v We check whether the node v is d -right-maximal or not by checking maxchild D (r’(v)). maxchild D (r’(v)) < d d -left-maximal d -right-maximal
19
We define the following subset of answers. Remove non-answers cand’(u) = {v | v ∈ cand(u) and maxchild D (r’(v)) < d} We compute cand’(u) by using range reporting query.
20
preord(v) : rank of preorder traversal in GST D’ end(v) : maximum rank in GST D’ (v) Computing cand’(u) preord(r(u)) ≤ preord(v) ≤ end(r(u)) weight(v) ≥ d maxchild(v) < d maxchild(r’(v)) < d preord(r(u)) ≤ preord(v) ≤ end(r(u)) max{maxchild(v), maxchild(r’(v))} < d ≤ weight(v) The nodes v in cand’(u) satisfy the following. We compute the nodes which satisfy these formula by using segment intersection query.
21
The nodes in GST D’ correspond to horizontal segments. The query correspond to vertical segment. Segment Intersection Query Problem preord(r(u)) end(r(u)) d beg(r(u)) ≤ preord(v) ≤ end(r(u)) max{maxchild(v), maxxhild(r’(v))} < d ≤ weight(v)
22
The number of horizontal segments is O(n). Computing cand’(u) Segment Intersection Query can be answered in O(loglog n + k) time with O(n) space data structure where n is the number of segments and k is the size of output. Lemma [Chan, 2013] For any node u in REx, cand’(u) can be answered in O(loglog n + |cand’(u)|) time with O(n) space data structure. Lemma
23
We can obtain the set of answers by computing cand’(u) for all node u in REx. There exist duplication and nodes u s.t. cand’(u) = ∅. We can skip such right extensions by using a range reporting query and a binary search on GST. Meaningful Right Extensions There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem
24
Conclusion Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem n : total length of strings in D rocc : number of d -right-maximal extensions of P occ : number of d -left-right-maximal extensions of P
25
Consider a more efficient algorithm. Can a single document version be solved more easily? special case of this problem Consider the minimal discriminating words problem for left-right extensions. Future Work Thank You !
27
Cand’(REx) may contains duplications because of definition of REx. About Cand’(REx) We want to remove such nodes from Cand’(REx), so we characterize above nodes.
28
If there exists an answer s.t. P occurs in the answer at least two times, there exist duplicated answers. Duplicated Answers P TiTi P P Let u be a node in REx s.t. P occurs in str(u) at least two times. For any node v s.t. str(v) is a proper suffix of str(u), cand’(u) ⊆ cand’(v). Lemma × ×
29
We use the following lemma. Checking P’s Let u be a node in REx. preord(u1) < beg(L(P)) ≤ end(L(P)) < preord(u2) iff P occurs in str(u) at once (P is a prefix of str(u)). Lemma k : SA str(u) [k] = 1 u1 : str(u1) = str(u)[SA str(u) [k−1]..|str(u)|] u2 : str(u2) = str(u)[SA str(u) [k+1]..|str(u)|]
30
Checking P’s GST D P L(P)L(P) u str(u1) i 1 str(u2) j P SA … … = str(u)
31
Checking P’s GST D P L(P)L(P) u str(u1) i 1 j P SA … … = str(u) P = str(u2)
32
≤ ≥ ≺ ≻ ≼ ≽ ≠ ∈ ∉ ∑ ∅ ∀ ∃ ⇔ symbol
33
P=aa,d=2 1. ababaabaaaacb 1. cbaabacabaabc 1. bbabaaca 具体例(案)
34
Each answer corresponds to a branching node in GST D. d-Right-Maximal Generic Words 2 a … a bc a bc c abc 11 ac 21 ac 13 ab $1$1 $2$2 $3$3 $3$3 … … … … … … L(P)L(P) Example P = aa, d = 2 T 1 = ababaabaaaacb, T 2 = cbaabacabaabc, T 3 = bbabaaca locus of P Such nodes exist in GST D (L(P)).
35
We cannot use related work [Kucherov et al.] directly. Main Idea TiTi P …
36
T1T1 = praguestringabc Generic Words T2T2 = bacompscienceap T3T3 = apscreenapscite T4T4 = strconferenceab T5T5 = wepscompresseda generic (or characteristic)
37
Main Idea cand(u) = {v | v ∈ GST D R (r(u)), weight D R (v) ≥ d, maxchild D R (v) < d} Cand(REx) = ∪ u ∈ REx cand(u) REx = {u | u ∈ GST D (L(P)), weight D (u) ≥ d} We define such a set of candidates Cand(REx) as the following. d-left-maximal extensions of a right extension of P right extensions of P candidates
38
We define a set of answers Cand’(REx), by removing non-answers from Cand(REx). Answers cand’(u) = {v | v ∈ GST D R (r(u)), weight D R (v) ≥ d, maxchild D R (v) < d, MFC(v) < d} Cand’(REx) = ∪ u ∈ REx cand’(u) d-left-maximal extensions of a right extension of P answers (with duplications) added
39
MFC ( v ) = 3 str ( v ) R is followed by “a” in 3 distinct strings. str ( v ) R is followed by “b” in 2 distinct strings. Non-answers MFC(v) : the maximum number of strings in D which have str ( v ) R c as a substring for any character c. T1T1 T2T2 T3T3 T4T4 T5T5 str(v) R a aa a b b Example
40
Non-answers By using this information, the following lemma holds. For any node v ∈ cand(u) for some u ∈ REx, MFC(v) ≥ d iff str ( v ) R is not an answer. Lemma MFC(v) : the maximum number of strings in D which have str ( v ) R c as a substring for any character c.
41
両側の方が自然 なんでみぎだけか
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.