Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.

Similar presentations


Presentation on theme: "Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan."— Presentation transcript:

1 Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

2 Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution Inferring a Labeling Function Suffix Tour Graph On a Binary Alphabet Outline

3 Hot Topic Direct Problem Given a string, compute its data structure. Reverse Problem Given a data structure, compute its string. Solving reverse problems could lead to deeper understanding of strings and data structures. Reverse Problems on String Data Structures data structure string data structure string border arrays, suffix arrays, DAWG, etc.

4 Border Array [Franek et al., 2002] [Duval et al., 2005] Suffix Array [Duval and Lefebvre, 2002] [Bannai et al., 2003] [Schürmann et al., 2005] DAWG [Bannai et al., 2003] Parameterized Border Array [I et al., 2009] [I et al., 2010] Reverse Problems on String Data Structures KMP Failure Function [Gawrychowski et al., 2010] Runs [Matsubara et al., 2010] Palindromic Structure [I et al., 2010] Prefix Table [Clement et al., 2009] Cover Array [Crochemore et al., 2010] LPF Table [He et al., 2011] We consider the reverse problem on suffix trees.

5 The suffix tree of w is the compacted trie which represents the suffixes of w. The suffix link of a node points to the node that represents the substring obtained by deleting the first character. Suffix Tree, Suffix Links 12345678 ababaaa$ 12345678 ababaaa$ 4 2 5 3 7 8 6 1 $ba $ba $a $ ba a a a a a $ a $ b a a a $ a a $ Index of suffixes. Suffix link

6 It can be solved in linear time [e.g. Ukkonen, 1995]. Direct Problem on Suffix Trees w  ababaaa$ Input: A string w. Output: The suffix tree of w. 4 2 5 3 7 8 6 1 $ba $ba $a $ ba a a a a a $ a $ b a a a $ a a $

7 Reverse Problem on Suffix Trees Input: An unlabeled ordered rooted tree T. Output: A string which realizes T (if such exists). A string w is said to realize T if the suffix tree of w is isomorphic to T.

8 Reverse Problem on Suffix Trees Input: An unlabeled ordered rooted tree T and links f. Output: A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f. link function f

9 Reverse Problem on Suffix Trees Input: An unlabeled ordered rooted tree T and links f. Output: A string which realizes T and f (if such exists). 425 3 7 8 6 1 A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f. link function f 12345678 ababaaa$ $ba

10 Reverse Problem on Suffix Trees Input: An unlabeled ordered rooted tree T and links f for inner nodes. Output: A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of w and its suffix links for inner nodes are isomorphic to T and f. ababaaa$ aaababa$ aababaa$ abaaaba$ link function f

11 How can we solve this problem? Input: An unlabeled ordered rooted tree T and links f for inner nodes. Output: A string which realizes T and f (if such exists).

12 How can we solve this problem? 425 3 7 8 6 1 12345678 ababaaa$ Input: An unlabeled ordered rooted tree T and links f for inner nodes. Output: A string which realizes T and f (if such exists). $ba If we can infer a “correct” order of leaves, we can get a string.

13 How can we solve this problem? 425 3 7 8 6 1 12345678 ababaaa$ Input: An unlabeled ordered rooted tree T and links f for inner nodes. Output: A string which realizes T and f (if such exists). $ba If we can infer a “correct” order of leaves, we can get a string. A naïve solution of considering all permutations takes O(n!) time.  We need to take into account some “constraints” on leaves’ order, which are implicitly given by input (T, f ).  We introduce suffix tour graphs to capture the constraints.

14 Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution Inferring a Labeling Function Suffix Tour Graph On a Binary Alphabet Outline

15 Input (T, f ) V: the set of nodes of T E: the set of edges of T  : the root node of T V in : the set of inner nodes of T V leaf : the set of leaf nodes of T  v  V, V(v), V in (v) and V leaf (v) respectively represent the set of nodes, inner nodes and leaf nodes of the subtree rooted at v. children(v) : the set of children of v. ch i (v) : the i-th child of v. par(v) : the parent of v. Notations  f : V in  {  }  V in ordered rooted tree T

16 1.The first child of the root node  is a leaf. ch 1 (  )  V leaf 2.There exists a path of function f from any node v 0  V in  {  } to the root node .  v 0  V in  {  },  v 1, v 2, …, v k s.t. v k   and v i  f (v i  1 ) for any 1  i  k Preconditions of an Input  satisfiednot satisfied

17 Depth of Nodes 0 depth( f (v))  1 depth(par(v))  1 if v  , if v  V in  {  }, if v  V leaf. depth(v)   0 1 2 3 4

18 In what follows… Infer the first character of the string of each edge, namely, a labeling function g : E   ∪ { $ }.  $ba $ba $a $ ba a a a a a $ a $ b a a a $ a a $

19 1.The edge from  to its first child is labeled with $. g(( , ch 1 (  )))  $. Conditions for g to hold  $ Infer the first character of the string of each edge, namely, a labeling function g : E   ∪ { $ }.

20 1.The edge from  to its first child is labeled with $. g(( , ch 1 (  )))  $. 2.The labels for the children are sorted in lexicographical order.  v  V, 1  i  |children(v)|, g((v, ch i (v)))  g((v, ch i  1 (v))). Conditions for g to hold v ca e d  $ Infer the first character of the string of each edge, namely, a labeling function g : E   ∪ { $ }.

21 3.Condition on links of parent-child nodes.  v  V  {  }, v p  par(v), there exists u  children( f (v p )) s.t. g((v p, v))  g(( f (v p ), u)). In addition, if v  V in then f (v)  V(u). Conditions for g to hold f (v p ) vpvp v c u c f (v) Infer the first character of the string of each edge, namely, a labeling function g : E   ∪ { $ }.

22 By Condition 3, the labels for inner edges (edges from inner nodes to inner nodes) can be uniquely determined. If the determined labels contradict Condition 2, the input turns out to be invalid. Labels for Inner Edges  $ba ba  $ba aa invalid

23 L g (v) means # of leaves in the following situation. When a labeling function g holds Conditions 1~3, we define the following values for any node v. L g (v)  | {u  V leaf | f (par(u))  par(v), g((par(u), u))  g((par(v), v))} | D g (v) =  y  V(v) L g (y) L g and D g c c par(v) par(u) 1 2 00 1100 0 0 1 1 1 v u

24 L g (v) means # of leaves in the following situation. When a labeling function g holds Conditions 1~3, we define the following values for any node v. L g (v)  | {u  V leaf | f (par(u))  par(v), g((par(u), u))  g((par(v), v))} | D g (v) =  y  V(v) L g (y) L g and D g c c par(v) par(u) 1 2 00 1100 0 0 1 1 1 1 2 00 1100 4 2 1 1 8 v u Constraints in leaves’ order : The next leaf of u is in V leaf (v). D g (v) leaves in V leaf (v) have constraints on such u’s.

25 4.# of leaves of subtree rooted at v must be at least D g (v). | V leaf (v) |  D g (v)  0 Conditions for g to hold 1 2 00 1100 0 0 1 1 1 1 2 00 1100 4 2 1 1 8 1 0 11 0011 1 0 0 0 L g (v) means # of leaves in the following situation. c c par(v) par(u) v u

26 When a labeling function g holds Conditions 1~4, we define the suffix tour graph STG g  (V G, E G ) w.r.t. g. V G  V E G  {(u, v) | u  V leaf, f (par(u))  par(v), g((par(u), u))  g((par(v), v))} ∪ {(u, v) k | (u, v)  E, k  | V leaf (v) |  D g (v)} Suffix Tour Graph  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  STG g

27 When a labeling function g holds Conditions 1~4, we define the suffix tour graph STG g  (V G, E G ) w.r.t. g. V G  V E G  {(u, v) | u  V leaf, f (par(u))  par(v), g((par(u), u))  g((par(v), v))} ∪ {(u, v) k | (u, v)  E, k  | V leaf (v) |  D g (v)} Suffix Tour Graph  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  STG g L g (v) means # of leaves in the following situation. c c par(v) par(u) v u

28 When a labeling function g holds Conditions 1~4, we define the suffix tour graph STG g  (V G, E G ) w.r.t. g. V G  V E G  {(u, v) | u  V leaf, f (par(u))  par(v), g((par(u), u))  g((par(v), v))} ∪ {(u, v) k | (u, v)  E, k  | V leaf (v) |  D g (v)} Suffix Tour Graph  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  STG g STG g is an Eulerian graph. (possibly disjoint) L g (v) means # of leaves in the following situation. c c par(v) par(u) v u

29 When a labeling function g holds Conditions 1~4, we define the suffix tour graph STG g  (V G, E G ) w.r.t. g. V G  V E G  {(u, v) | u  V leaf, f (par(u))  par(v), g((par(u), u))  g((par(v), v))} ∪ {(u, v) k | (u, v)  E, k  | V leaf (v) |  D g (v)} Suffix Tour Graph  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  STG g STG g is an Eulerian graph. (possibly disjoint)

30 When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle. Necessary and Sufficient Condition for (T, f ) and g to be valid  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  425 3 7 8 6 1 STG g STG g has an Eulerian cycle that contains  and all leaves. STG g is an Eulerian graph. (possibly disjoint)

31 When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle. Necessary and Sufficient Condition for (T, f ) and g to be valid  1 0 001 1 1 0 0 0 1 1 $ba ba$ ab ba ba  STG g Example for an invalid labeling function g. STG g has an Eulerian cycle that contains  and all leaves. STG g is an Eulerian graph. (possibly disjoint)

32 An Eulerian cycle can be computed in linear time in the graph size. We also showed that the size of STG g is linear in the input size. Given g, we can check if g is valid or not by constructing STG g ⇒ computing an Eulerian cycle in linear time in the input size. Computing an Eulerian Cycle What remains is to find a valid labeling function g.

33 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 1 001 1 0 0 0 0 1 1 $ba ba$ $a ba ba  STG g

34 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 0 001 1 0 0 0 1 1 1 $ba ba$ $b ba ba  STG g

35 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 1 001 1 0 0 0 0 1 1 $ba ba$ ab a$ a$  STG g

36 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 0 001 1 0 0 0 1 1 1 $ba ba$ ab b$ b$  STG g

37 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 0 001 1 1 0 0 0 1 1 $ba ba$ ab ba ba  STG g

38 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. On a Binary Alphabet  1 0 001 1 1 0 0 0 1 1 $ba ba$ ab ba ba  STG g On a binary alphabet, the reverse problem on suffix trees can be solved in linear time. Theorem

39 We introduced suffix tour graphs which lead to the efficient solution of the reverse problem on suffix trees. (Note that it can be applied to non-binary cases.) On a binary alphabet, we showed that the problem can be solved in linear time in the input size. What about non-binary cases? ⇒ It seems to be difficult ⇒ since # of labeling functions g increase combinatorially. What about the problem in which suffix links are not given? ⇒ I do not have any idea. Summary Open Problems

40 Compute a string which realizes this tree and links. Exercise?

41 These labels are determined uniquely. Hints $ ab ab ab ab abab$ aabb$ $ $

42 Compute a string which realizes this tree and links. Exercise? babaabaaababaa$ babaababaaabaa$ babaaababaabaa$

43 Ex $ ab ab ab ab abab$ aabb$ $b $ $

44 Ex $ ab ab ab ab abab$ aabb$ $b $ $

45 Ex 1 0 0 10 1 11 01011 0 1 0 0 0 00 0000 $ ab ab ab ab abab$ aabb$ $b $ $

46 babaabaaababaa$ babaababaaabaa$ babaaababaabaa$

47      ∩  ∪           記号

48 没スライド

49 有限アルファベット:  e.g.   { a, b, c, d } 文字列: w   * e.g. w  abacd 文字列 w の長さ: |w| w  xy のとき、 y を w の接尾辞という 準備 12345678 shizuoka 12345678 shizuoka “shizuoka” の接尾辞 (長さ 0 のものは非表 示) s hs hi z u o k a xy w

50 有限アルファベット:  e.g.   { a, b, c, d } 文字列: w   * e.g. w  abacd 文字列 w の長さ: |w| 文字列 w の i 番目の文字: w[i] 文字列 w の i 番目から j 番目までの部分文字列 : w[i…j] w  xy のとき、 x を w の接頭辞 y を w の接尾辞 準備 という a b a c d a bc b a b d xy w

51 Suffix Tree, Suffix Links The suffix tree of w is the compacted trie which represents the suffixes of w. 86257101493 $ r b a p a p a $ r b a p a p a $ r b a p a p a $ p a p a p a p a b a a $ $ $ p a p a $ $ Suffix link 12345678910 barbapapa$ 12345678910 barbapapa$ $ Index of suffixes.

52 逆問題の入力 (T, f ) V : T のノード集合 E : T の辺集合  : T の根ノード V in : T の内部ノード集合 V leaf : T の葉ノード集合  v  V に対して V(v), V in (v), V leaf (v) で v を根とする部分木の ノード集合, 内部ノード集合, 葉ノード集合を表す children(v) : v の子ノード集合 ch i (v) : v の i 番目の子ノード par(v) : v の親ノード. ただし、 par(  )  とする 記号の定義  f : V in  {  }  V in 根付き順序木: T

53 L g (v) means # of leaves in the following situation. When a labeling function g holds Conditions 1~3, we define the following values for any node v. L g (v)  | {u  V leaf | f (par(u))  par(v), g((par(u), u))  g((par(v), v))} | D g (v) =  y  V(v) L g (y) L g and D g c c par(v) par(u) 1 2 00 1100 0 0 1 1 1 1 2 00 1100 4 2 1 1 8 v u Constraints in leaves’ order : The next leaf of u is in V leaf (v). D g (v) leaves of V leaf (v) have constraints.

54 In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. (See the last figure of our paper.) On a Binary Alphabet On a binary alphabet, the reverse problem of suffix trees can be solved in linear time. Theorem

55 Given g, we can check if g is valid or not by constructing STG g ⇒ computing an Eulerian cycle in linear time in the input size. In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. (See the last figure of our paper.) To Find a Valid Labeling Function g On a binary alphabet, the reverse problem of suffix trees can be solved in linear time. Theorem

56 1. 各ノード v に対して, hole(v)  |V LEAF (v)|   i  1 dep i (v) を計算 条件の4 -3 まで満たされているときに, 4-4 が満たされてい るかをチェックしつつ満たされていたら文字列を出力する アルゴリズム 1 0 0 10 1 11 01011 0 1 0 0 0 00 0000 $ ab ab ab ab abab$ aabb$ $b $ $ 根も統一的にかけるように 根の上に特別なノードを用意したほうがいいかも


Download ppt "Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan."

Similar presentations


Ads by Google