On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 3: Parallel Algorithm Design
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Lecture 5: Master Theorem and Linear Time Sorting
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Data Structures – LECTURE 10 Huffman coding
Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.
Courtesy Costas Busch - RPI1 Mathematical Preliminaries.
Augmenting Suffix Trees, with Applications Yossi Matias, S. Muthukrishnan, Suleyman Cenk Sahinalp, Jacob Ziv Presented by Genady Garber.
Analysis of Algorithms CS 477/677
External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Sorting Fun1 Chapter 4: Sorting     29  9.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
September 29, Algorithms and Data Structures Lecture V Simonas Šaltenis Aalborg University
Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
1 Splay trees (Sleator, Tarjan 1983). 2 Goal Support the same operations as previous search trees.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
CSCI 256 Data Structures and Algorithm Analysis Lecture 10 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Search Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
McCreight's suffix tree construction algorithm
Andrzej Ehrenfeucht, University of Colorado, Boulder
Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
Lectures on Graph Algorithms: searching, testing and sorting
Suffix trees.
Splay trees (Sleator, Tarjan 1983)
Suffix trees and suffix arrays
String Matching with k Mismatches
Presentation transcript:

On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere

Fact From the Previous Talk Harel and Tarjan 1984, Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca (u, v) can be computed in constant time.

What’s in This Paper Bounds depend on the alphabet –Constant size alphabet – O(n) (Weiner 1973) –For unbounded alphabet  (n log n) –For {1…n} – linear time RAM algorithm DAM algorithm (I/O optimal) Algorithm also works for PRAM, PDAM

Talk Outline Suffix trees –Reminder –Tools RAM algorithm for suffix tree construction Conclusion

Suffix Trees S = $ $ 1 2 $ $ S[8,13]=12221$ n = 13

Suffix Tree Representation $  = 1 2 $

Properties of Suffix Trees  = 1 L=1  = 12 L= lcp (  (v),  (w)) = |  ( lca (v, w)| 1 1  = 11 L=2 vw lca (v, w)

Suffix Links Lemma [Weiner 1973] Let a   and    *. If there is a node v in T s such that  (v)=a , then there is a node w in T s such that  (w)= . Define the suffix link as sl(v) = w.

Suffix Links  = 1 L=1  = 12 L=2  = 122 L=3  = 2 L=

Suffix Links Example

Suffix Arrays Let  ={S i | S i   *, |S i |=n i } T = compacted trie of  In order traversal of leaves gives strings in lexicographical order – S p 1, …, S p |  | sort array  A T [i]=p i longest common prefix array  LCP T [i] = lcp (S p i, S p i+1 )

Suffix Array Example ATAT LCP T 1 1  = 11 L=2

RAM Algorithm Input: string SOutput: T s Divide and Conquer: 1.Recursively compute T o – compacted trie of suffixes beginning at odd positions 2.Recursively compute T e – compacted trie of suffixes beginning at even positions 3.Merge T e and T o to get T s

Divide and Conquer Scheme A(n)A(n) A(n/2)A(n/2)A(n/2)A(n/2) A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4) S(n/2)S(n/2)S(n/2)S(n/2) S(n)S(n) Divide Conquer Merge

RAM Algorithm Scheme |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Switching Representations |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Suffix Tree  Suffix Array ATAT LCP T 1 1  = 11 L=2

Suffix Array  Suffix Tree ATAT LCP T 1 1  = 11 L=2

Compressing S |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Compressing S Input: |S|=n  =[n] Map character pairs into single characters: –For i=1 to n form pairs  S[2i-1], S[2i]  –Sort lexicographically by radix sort O(n) –Remove duplicates S’[i] = rank of  S[2i-1], S[2i]  Now |S’|= n / 2 and  ’=[ n / 2 ]

Example S= $  =[13] 1.Pairs  1,2   1,1   1,2   2,1   2,2   2,1  2.Ordered pairs  1,1   1,2   1,2   2,1   2,1   2,2  3.Duplicates removed  1,1   1,2   2,1   2,2  4.S’=212343$  =[4]

Decompressing S |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Decompressing S Input : A Ts’, LCP Ts’ Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[ n / 2 ]$ A To [i] = A Ts’ [i] · 2 – 1 · · · · · · if S[ A To [i]+2* LCP Ts’ [i]] = S[ A To [i+1]+2* LCP Ts’ [i]] 1 { LCP To = 2 · LCP Ts’ + otherwise0

Building the Even Tree |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Building the Even Tree Input : A To, LCP To Observation : P = even suffix of S then P = aP’ and P’ = odd suffix of S To get A Te apply radix sort on even suffixes S[2i,n] using the keys  S[2i], S[2i+1,n]  if S[2i]=S[2j] lcp (S[2i+1,n], S[2j+1,n])+1 { lcp (S[2i,n], S[2j,n]) = otherwise0

Merging T o and T e |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

Merging T o and T e Input :  A To, LCP To  and  A Te, LCP Te  Trivial method – sort suffixes lexicographically  (n 2 ) What if we have an oracle for lcp (S[2i, n], S[2j-1, n]) ? Merge A To and A Te directly (like sorted lists) Compute LCP T from previous results: 1.lcp of adjacent odd suffixes by LCP To 2.lcp of adjacent even suffixes by LCP Te 3.lcp of odd suffix and even suffix by oracle

Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM

12 12 AB C DE F T1T1 T2T2 TMTM 1

12 12 AB C DE F T1T1 T2T2 TMTM 1 1 A+D

Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM BA+D

Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2

Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2 2 C+F

Coupled-DFS (the compacted case) AB C DE F T1T1 T2T TMTM

Coupled-DFS (the compacted case) AB C DE F T1T1 T2T2 TMTM 1 D 2 C+F G 3

Over-Merging T o and T e How do we merge compacted tries? An over-merge is like a merge but: –Compare only first characters of edges –In case of two edges with different lengths, k<l break l into k and l-k –Identify edges with first letter only

Over-Merge Example AB C DE F T1T1 T2T2 1x TMTM 1 D 2 C+F G 3

Over-Merge of Running Example ToTo S= $

Over-Merge of Running Example TeTe S= $

Over-Merge of Running Example TMTM S= $

Building the lcp Oracle Definitions –Node in both T M and T o is odd –Node in both T M and T e is even –Node with both odd and even descendents is odd/even For every odd/even node u find l 2i and l 2j-1 such that u = lca (l 2i, l 2j-1 ) Compute d(u) = lca (l 2i+1, l 2j ) Compute  (u) = depth(u) in d-pointers tree

Over-Merge of Running Example TMTM S= $

Main Theorem The function d defines a tree on the odd/even nodes of T M, and for any l 2i and l 2j-1 we have  ( lca (l 2i, l 2j-1 ) ) = lcp (S[2i,n], S[2j-1,n])

Helpful Observations Let u be an odd/even node in T M. u is Either even or odd and so L(u) is defined. Let u be an even node: 1. For l 2i and l 2j below u lcp (S[2i,n], S[2j,n])  L(u) 2. For l 2i’-1 and l 2j’-1 below u lcp (S[2i’-1,n], S[2j’-1,n])  L(u) 3. For l 2i” and l 2j”-1 below u lcp (S[2i”,n], S[2j”-1,n])  L(u) Symmetrical proof is u is an odd node.

Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof: Suppose lca (l 2i’, l 2j’-1 ) = lca (l 2i’’, l 2j”-1 ) = u  lcp (S[2i’,n], S[2j’-1,n]) = k  L(u) lcp (S[2i’,n], S[2i”,n])  L(u)  k  lcp (S[2i”,n], S[2j’-1,n]) = k S[2i’,n] S[2j’-1,n] S[2i”,n] k L(u)L(u)

Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n]. Base: If S[2i’]  S[2j’-1] then lca = root (recall the merge procedure)  lcp = 0. Assumption: Suppose theorem is true for lcp 0 u = lca (l 2i, l 2j-1 )  u  root. Suppose d(u) = lca (l 2i’+1, l 2j’ ) then:  (u) =  (d(u)) = lcp (S[2i’+1,n], S[2j’,n]) = 3 lcp (S[2i,n], S[2j-1,n]) 

Done! |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)

The End