1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last Algorithm Group)
2 CIAA 2003 Eighth International Conference on Implementation and Application of Automata July 16-18, 2003, Santa Barbara, CA, USA Topic / Committee / Community
3 Why did I select this paper ? DAWG start 1985… not so far Continueing development cDAWG, ASDAWG, morphic DAWG, WDAWG, SDAWG, two-tree DAWG, DASG, CSDAWG etc. TST : 1997 – 98, TDAWG : 2003 DAWG : Widely Apply by Bioinformatics, NLP, Graph Theory, String Matching, Automata etc. Speed & Space Trends in Huge Data Management Topic for Algorithm Group Matching the interesting topics in this seminar group
4 Content DFA (use in string matching’s problem) DAWG Ternary Search Tree Paper : TDAWG, Experiment & Result Paper : Conclusion Paper : Discussion
5 DFA Deterministic Finite Automata
6 Formalities Deterministic Finite Accepter (DFA) : set of states : input alphabet : transition function : initial state : set of final states
7 Set of States
8 Input Aplhabet
9 Initial State
10 Set of Final States
11 Transition Function
12
13
14
15 Transition Function
16 Another Example accept
17 = { all substrings with prefix } accept
18 = { all strings without substring }
19 DAWG Directed Acyclic Word Graph
20 DAWG
21 DAWG
22 DAWG
23 cDAWG
24
25 TST Ternary Search Tree
26 TST History Jon L. Bentley and Robert Sedgewick Algorithms for Sorting and Searching Strings, Proceeding. 8th Annual ACM- SIAM Symposium on Discrete Algorithms (SODA), January Ternary Search Trees, Dr. Dobb's Journal, April Dictionary of Algorithms and Data Structures, National Institute of Standard and Technology,
27 BST DST TST
28
29 TDAWG Ternary Directed Acyclic Word Graph
30 Introduction DFA how to implement the transitions of each state ? (Time & Space efficiency) TST “implant” BST for transitions –Good Time DAWG smallest DFA for all suffixes –Good Space TDAWG Proof : TDAWG VS. DAWG
31 Hypothesis / Theorem (1/2) Time = Construct + Search (useable for online) DFA function = Alphabet (Chinese & Japan ~ 1000 chars) State Table O(|p|) p = length of pattern Table use very large memory Link List O(| | x |p|) search time If is large … problem for search time
32 Hypothesis / Theorem (2/2) For TDAWG –Use O(|S|) space –Use O(log| | x |p|) for search time –Use O(| | x |S| 2 ) construct time (Bentley & Sedwick) –Use O(| | x |S|) construct time (this paper … apply from Blummer’s online DAWG construction) Comparison : TDAWG VS. DAWG(table & link list) –Space, Search Time, Construction Time
33 TST TDAWG
34 Online DAWG Construction
35 Online TDAWG Construction
36 Experiment Result
37 Conclusion New data structure … TDAWG Construction time (English text 256) –TDAWG < linklistDAWG < tableDAWG Space Requirment –linklistDAWG < TDAWG ~ 20 % –tableDAWG not compare in same scale Search Time –Short pattern: tableDAWG best, TDAWG < linklistDAWG –Log curve VS. Linear Curve (long pattern?)
38 Discussion & Future Work In Asian Language (characters~1000s) should have better search time than English (character 256) because log(| |x|p|) Apply to other DAWG… cDAWG, minimumDAWG …etc. More efficiency by AVL tree (AVL-balance) Bioinformatic have 4 character. But, Sliding window with 12 characters = 4 12