Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations Shuhei Denzumi1, Ryo Yoshinaka2, 1, Shin-ichi Minato1,2, and Hiroki Arimura1 1) Hokkaido University 2) JST ERATO Minato Discrete Structure Manipulation System Project My name is Shuhei Denzumi.
Background Researches on string processing become active. Data mining Massive online data: The internet and sensing networks. String matching and string mining problems. Data mining Input data should be represented in compact form Computation under compressed structure is needed Input Back ground. Researches on string processing become more active recently. The internet and sensing network provide massive online data every day. String matching and string mining are required to handle these data. Then, efficient data structures are needed to solve these problems. We use data structures to represent data compactly and execute computation under compressed form. Data Structure Result Input Compress Operation Input Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Manipulatable & Compact Manipulatable Compact data structure Represent data in compressed form Have operations to manipulate data in compacted style Get much attention for recent years Binary Decision Diagram (BDD) LSI area Deterministic Finite Automata (DFA) Natural Language Processing area Data Structure Input Compaction D 1 In many types of data structures, we forcus manipulatable compact datastructure. This data structure not only represent data in compact form, but also have efficient method to compute operation between multiple different structures. For instance of manipulatable compact data structure, Binary Decision Diagram and Deterministic Finite Automata exist. Binary Decision Diagram, BDD, is widely used in LSI designing area. DFA is applied to natural language processing. Operation D 3 D 2 Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Sequence Binary Decision Diagram What is Sequence BDD? Sequence Binary Decision Diagram (SeqBDD, SDD). Loekito, Bailey, and Pei (2009) Graph structure Represent finite sets of strings with finite length SDD’s basic properties are unknown Minimization Size complexity Operation time Application Data mining Graph mining Human genome sequencing Text … Sequence Binary Decision Diagram Sequence Binary Decision Diagram is a new manipulatable compact data structure. It is porposed by Leokito, et.al in 2009. The SDD represent finite sets of strings with finite length. The basic properties of SDD are still unknown, because it is new. We think that SDD is applicable to data mining, enumerating and bioinformatics area. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Family of BDDs Compact representation for discrete structure With rich algebraic operations BDD [Bryant 1986] Boolean functions xy ∨ yz ∨ zx ¬xyz ∨ x¬yz ∨ xy¬z ZDD [Minato 1993] Sets of combinations {{a}, {b}, {a, b}} {{a}, {b}, {c}, {a, b, c}} SDD [Loekito, et.al 2009] Sets of strings {a, b, ab, bab, abbab} {abc, acb, bac, bca} The SDD is a new member in the BDD family. Respective members of the family represents and different tyeps of discrete structures. The BDD, the original of the family, manipulate boolean functions. The ZDD, manipulates sets of combinations. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Result Relationship to Acyclic Deterministic Finite Automata (ADFA) Translation from an SDD to an ADFA and vice versa An SDD is never larger than an ADFA An SDD can be |Σ| times smaller than an ADFA Computational complexity of binary set operations Generalize eight set operations Tight analysis on time complexity for binary set operation algorithm Experimental results SDDs can be smaller than ADFAs Binary operation time Our results can be devided into two parts. First, we study the relationship between SDD and ADFA. We proved that SDD can be alphabet size times smaller than ADFA. In second part, we analyzed time complexity of bianry set operations on SDDs. We also conducted some experiments. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Preliminary Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Definition Σ: alphabet (totally ordered by ≺) Internal node: , , , , 1/0 - terminal node: / 1/0 - edge: / SDD: directed acyclic graph Internal node S, τ(S) ↦ 〈S.lab, S.1, S.0〉 S.lab: label S.1: 1-child S.0: 0-child Ordering rule N.lab ≺ (N.0).lab a b … z 1 S.0 S.lab S S.1 1 a b c Let sigma be an alphabet with total order. SDD is directed acyclic graph structure. It composed from internal nodes and 0 and 1 terminal node. Internal nodes have exactly two edges, 1-edge and 0-edge. 1-edge are denoted by solid line, and 0-edge is denoted by broken line. SDD internal nodes are repsesented by a tiple tau(S) = <S.lab, S.1, and S>0> S.Lab is a letter labelling the node. S.1 is the node pointed by 1-edge by S. S.0 is the node pointed by 0-edge by S. We call them as 1-child and 0-child respectively. Internal nodes follow ordering rule. The latter labeling a node must be smaller than the letter labeling the 0-child of the node. a b z … ≺ Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Semantics L(N): set of strings N represents L( ) = {ε} L( ) = {} L(N) = N.lab・L(N.1) ∪ L(N.0) A path from the root to the 1-terminal node represent a string. 1 a b {ε} {} {b} {a, b} {bb} {aa, ab, bb} 1 a b {ε} {} {b} {a, b} {bb} {aa, ab, bb} 1 a b {ε} {} {b} {a, b} {bb} {aa, ab, bb} 1 a b {ε} {} {b} {a, b} {bb} {aa, ab, bb} 1 Every SDD node represents a set of strings. The one terminal node represents the singleto of the empty string. 0-terminal node represents the empty set. And the set of strings of an internal node is recursively defined by this formula. The set of strings of an SDD is the one represented by its root node. For convenience, we identify a node with the SDD %That has the node as its root. consisting all the nodes reachable from that node. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Comparison to ADFA a b c a b accept state reject state 1 a b c 1 a 1 a b {a, b} {bb} {aa, ab, bb} {aa, ab, bb} a b c {a, b} a b {b} Let’s compare SDD with Acyclic Deterministic Finite Automata. Roughly speaking, one way think of an SDD as a binarized ADFA. Here is a node of an ADFA with three outgoing edges labeled with a, b and c, Which correspond to three nodes of an SDD labeled with a , b and c. In an SDD, One checks whether the input symbol is a, and if so one takes the edge labeled with a, And otherwise checks whether it is b or not, and so on. Such a sequential process is represented by a series of SDD nodes in this way. b c a Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Reduction process a・{} ∪ L(N.0) = L (N.0) a Suppression a Suppression N.1 ≠ 0-terminal node In ADFA, removing edges pointing dead state Merging τ(N) = τ(N’) ⇒ N = N’ In ADFA, share all equivalent nodes Theorem Under these rules, SDD is unique and minimal Like ADFA’s have unique canonical form N.0 N.0 N.0 x N N.1 N’ An SDD can be reduced through this reduction process. The first rule is suppression rule. If an internal node points 0-terminal node by 1-edge, then such node is deleted. By definition, the node represents the same set of strings with its 0-child. So, such node is redundant. L(P) = a・{} ∪ L(Q) = L(Q) Second rule is merging rule. If two different node have the same triple, label, 1-child, and 0-child. Such nodes are merged. In ADFA, this rule corresponds to merge all equivalent nodes. After this reduction process, an SDD get canonical form and it is minimal. Just like ADFAs have the canonical form. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Characteristic Almost isomorphic to Acyclic Deterministic Finite Automata BDD/ZDD techniques are applicable Binary form Simple recursive algorithm Easy to implement Rich collections of operations Use of hash tables To share equivalent nodes To share intermediate computations BDD/ZDD ADFA As I have explaned, SDD is almost isomorphic to ADFA. However, there are some differences between SDD and ADFA. As an advantage of SDD, We can apply BDD/ZDD techniques to SDD. Binary form allows simple recursive algorithms for SDD. It makes implementation of SDD be easier. Rich collection of operations inherited from ZDD can apply to SDD. Same as other members of BDD family, SDD use hash tables. SDDs have hash tables to keep SDDs always reduced and to make set operations work efficiently. SDD Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Relationship to Acyclic Automata Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Size a b c An SDD node correspond to an ADFA edge The description size is proportional to |N|: the number of internal nodes in SDD N |A|: the number of edges in ADFA A b c a a b c We define the size of SDD and ADFA. SDD size is defined by the number of internal nodes. On the other hand, ADFA size is defined by the number of edges. Since the description sizes are proportionally to them each other. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Theorem: Size compare For equivalent an SDD and an ADFA From an ADFA A to an SDD N From an SDD N to an ADFA A SDD |Σ| times can be smaller than ADFA We compare the size of an SDD to that of ADFA representing the same set of strings. We proved two theorems on size. First, SDD is never larger than ADFA. First, an SDD is not larger than the ADFA. This equality can be An ADFA can be |Sigma| times larger than the SDD. In other words, an SDD can be |Sigma| times smaller than the ADFA. There are cases that holds this equality. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
0-child sharing e c a b d d c e a c d e b This slide illustrates why an SDD can be smaller than an ADFA. Suppose that an ADFA has two distinct nodes that have an edge labeled by the same letter that goes to the same node, like the edges labeled by c, d and e. In ADFA those edges are not merged, but in an SDD, those three edges turn to be nodes and merged in this way. We have 8 edges on the right figure, while we have only 5 nodes on the left. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Example a c a b b c c b c c b a b c a {anbicj, n = 0, …, 4, i, j = 0, 1} ADFA A SDD S 1 a b c a c a b b c c b c c b a b c This slide shows a typical case that SDD size is smaller than ADFA. Assume that we want to represent 3 sets of strings, {a,b,c}, {b,c}, {c} SDD can represent the sets by just 3 nodes, but ADFA needs 6 edges. In SDD, nodes are labeled, but edges are labeled in ADFA. However, both SDD and ADFA merges nodes. It causes this difference of size. Since their edges are labeled. They cannot be merged, but SDD can merge their nodes. This is the example that SDD size is asimptically |Sigma| times smaller. The number of SDD nodes is 6, and the number of ADFA edges is 14. a |S| = 6 |A| = 14 Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Experiment Input: Canterbury corpus BibleAll: bible.txt, BibleBi: all bigrams from bible.txt, Ecoli: E.coli.txt Fac means store all fanctors of input data This experiment shows the ratio of the sizes of SDD and ADFA for different sets of strings. We can see the effect of merging in binary form for real data. From bible.txt, we make two type data. BibleAll is just the set of the sentences of the text. BibleBi is all set of bigrams got from bible.txt. Ecoli is single genome data. The (Fac) means that the SDD and ADFA stores all factors of inputs. For BibleALL, the sizes are almost equivalent. But for other data, SDDs are about 10 to 20 % smaller than ADFAs. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Binary Set Operation Algorithm SDD has inherited the algorithm to compute binary set operations from BDD. We define the algorithm and analyze its time complexity. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Set operation A binary set operation ♢ ∈ {∪, ∩, \, …} Input: two SDDs P, Q Output: SDD R such that L(R) = L(P) ♢ L(Q) P Q Binary Set Operation P ♢ Q We would like to compute the union or intersection of two sets. By P daia Q we denote the unique SDD R that represents the set L(R) = L(P) daiamond L(Q) SDD’s binary set operation create SDD P daiya Q for two inputs SDDs P and Q. Where L(P diaya Q) = L(P) daiya L(Q) We create a new SDD via Binary set operations . Given 2 SDDs P and Q, we compute the SDD R, such that Set operation is a method to create new SDDs by 2 given input SDDs. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Apply algorithm Originally for BDD [Bryant 1986], applied to SDD Based on the definition L(N) = N.lab ・ L(N.1) ∪ L(N.0) In operation, (when P.lab = Q.lab) L(P) ♢ L(Q) = P.lab ・ (L(P.1) ♢ L(Q.1)) ∪ (L(P.0) ♢ L(Q.0)) P Q P♢Q a Q1 Q0 a P1♢Q1 P0♢Q1 a P1 P0 ♢ For binary set operation, we can use apply algorithm. It is proposed for BDD by Bryant. It can be used for SDD too. It compute P daiyamond Q recursively. Based on the formula of the definition of a set of strings for an SDD node, Apply algorithm devide the operation P daiamond Q into two direction. 0-side recursion and 1-side recursion. To compute set operations P daia Q, an algorithm named applie is used. It is proposed for BDD by Bryant in 1986, but it can apply SDD too. This simple recursion is enough to create SDD P daia Q. L(P daia Q) = LP LQ = L(P1) daia L(Q1) = L(P) daia L(Q) The computation of P daia Q is simply decomposed into two subproblems computing P0 daia Q0 and P1 diamond Q1. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Hash table technique N0 x N N1 Key-Value hash tables Uniquetable Key: 〈letter x, SDD node N1, SDD node N0〉 Value: SDD node N with τ(N) = 〈x, N1, N0〉 Opcache Key: 〈operation id ♢, SDD node P, SDD node Q〉 Value: SDD node R which is R = P ♢ Q P ♢ Q P ♢ Q Key (triple) 〈x, N1, N0〉 Value (node) N Uniquetable Key (triple) 〈♢, P, Q〉 Value (node) R Opcache During computing set operations, we use two key-value type hash tables. The first one is called uniquetable. Its key is a triple, <label, 1-c, 0-c>, and its value is the node with the triple. %Key is composed by //. This is used to prevent from creating 2 nodes that have the same label, 1-child and 0-child. The other is called opcache. Its key is also a tripes, <operation id daia, SDD P, SDD Q> %Key is composed by operation name, right argument P, and left argument Q. And value is the node of the result of the binary operation P daia Q. This is used to avoid repeting of computation that has been already done. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Node create process Any SDD node needed during computation is created via this process Once an internal node is registered in Uniquetable, equivalent nodes will not created anymore. Check the Uniquetable for key 〈x, N1, N0〉. Exist Not exist Uniquetable is used in the process to create nodes. Return it. Create a new node and return it. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Time complexity When P ♢ Q is executed Naïve method This method Every operation use Opcache At most |P| × |Q| different instances of recursive calls invoke (Assume that the access time to hash tables is constant) Naïve method Prepare |P| × |Q| size table This method No useless or redundant node Theorem Worst case O(|P| |Q|) time Example needs Ω(|P| |Q|) time exist Lower and upper bound got Check the Opcache for key 〈♢, P, Q〉. Exist Not exist P ♢ Q is already done, return it. Continue to computation on 0-side and 1-side. We show the time complexity of alpy algorithm. It is easy to understand, that there are at most PQ different instances of recursive calls. So, we can complete the computation in O(PQ) time by storing results of subproblems in opcache. Naïve method will prepare PQ size table to do this. Our method don’t create needless node, or multiple equivalent nodes. Although this doesn’t improve the worst case complexity, Our experiments demonstrate apply algorithm is much more efficient done than the naïve aproach. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Experiment Operation time Prepare two SDDs for all factors of random texts of length n Time to compute operation We prepare two SDDs that store all factors of given random text of length n. These lines indicate the computation time of set operations for these SDDs. We can see that apply algorithm runs in almost linear time practically. Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Conclusion Relationship to Acyclic Automata An SDD can be |Σ| times smaller than an ADFA For real data, SDDs are 10~20 % more compact than ADFAs Computational complexity of binary set operations Worst case time complexity is quadratic Tight time bound is analyzed In our experiment, operation time is almost linear Future work Efficient implement of various operations Propose substring index on SDD Factor SDD construction algorithm Conclusion. We strudied rela Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations by Shuhei Denzumi, Ryo Yoshinaka, Shin-ichi Minato, and Hiroki Arimura, 2011-08-30 (TUE), Prague Stringology Conference 2011
Thank you!