Augmenting Suffix Trees, with Applications Yossi Matias, S. Muthukrishnan, Suleyman Cenk Sahinalp, Jacob Ziv Presented by Genady Garber.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Analysis of Algorithms
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Zoo-Keeper’s Problem An O(nlogn) algorithm for the zoo-keeper’s problem Sergei Bespamyatnikh Computational Geometry 24 (2003), pp th CGC Workshop.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Data Structures – LECTURE 10 Huffman coding
Lecture 6: Point Location Computational Geometry Prof. Dr. Th. Ottmann 1 Point Location 1.Trapezoidal decomposition. 2.A search structure. 3.Randomized,
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Important Problem Types and Fundamental Data Structures
Binary Trees Chapter 6.
Randomized Algorithms - Treaps
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Union Find ADT Data type for disjoint sets: makeSet(x): Given an element x create a singleton set that contains only this element. Return a locator/handle.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ariel Rosenfeld Bar-Ilan Uni.
Binary Trees, Binary Search Trees
Priority Queues and Heaps
Ch. 8 Priority Queues And Heaps
Chapter 11 Data Compression
Suffix trees.
Binary Trees, Binary Search Trees
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Union-Find Partition Structures
Suffix Trees String … any sequence of characters.
Binary Trees, Binary Search Trees
Sorting We have actually seen already two efficient ways to sort:
Presentation transcript:

Augmenting Suffix Trees, with Applications Yossi Matias, S. Muthukrishnan, Suleyman Cenk Sahinalp, Jacob Ziv Presented by Genady Garber

Abstract Theory of string algorithms play a fundamental role in Theory of string algorithms play a fundamental role in Information retrievalInformation retrieval Data compressionData compression This work consider one algorithmic problem from each area. This work consider one algorithmic problem from each area. The algorithms rely on augmenting the suffix tree (adding extra edges, resulting DAGs) The algorithms rely on augmenting the suffix tree (adding extra edges, resulting DAGs) The algorithm construct these “ suffix DAGs ” and manipulate them to solve the problems. The algorithm construct these “ suffix DAGs ” and manipulate them to solve the problems.

Introduction This paper presents two algorithmic problems This paper presents two algorithmic problems Data CompressionData Compression Information RetrievalInformation Retrieval All these algorithms rely on the suffix tree data structure All these algorithms rely on the suffix tree data structure ST with suitably simple augmentations are very useful in string processing applications ST with suitably simple augmentations are very useful in string processing applications In described work the suffix tree was augmented with extra edges and additional information In described work the suffix tree was augmented with extra edges and additional information

Problems and Background The Document Listing Problem The Document Listing Problem The HYZ Compression Problem The HYZ Compression Problem

The Document Listing Problem Given a set of documents T = {T 1,..., T k } Given a set of documents T = {T 1,..., T k } Given a query pattern P Given a query pattern P The problem The problem To output a list of all documents containing P as a substring (the standard problem can be solved in time, proportional to number of occurrences of P in T. The goal is to solve the problem with a running time depending on the number of documents containing P)To output a list of all documents containing P as a substring (the standard problem can be solved in time, proportional to number of occurrences of P in T. The goal is to solve the problem with a running time depending on the number of documents containing P) To report the number of documents containing P (existing algorithm solves this problem in O(|P|) and is based on data structures for computing Lowest Common Ancestor)To report the number of documents containing P (existing algorithm solves this problem in O(|P|) and is based on data structures for computing Lowest Common Ancestor)  May be used in “ morbid “ applications (discovering gene homologies)

The (  )-HYZ Compression Problem Given a binary string T of length t Given a binary string T of length t Need to replace disjoint blocks of size  with desirably shorter codewords (to allow future perfect decompression) Need to replace disjoint blocks of size  with desirably shorter codewords (to allow future perfect decompression) Compression Algorithm To compute the codeword c j for block j we determine its context (the context of a block T[i : l] is the longest substring T[k : i - 1], k < i, of size at most , T[k : l] occurs earlier in T) To compute the codeword c j for block j we determine its context (the context of a block T[i : l] is the longest substring T[k : i - 1], k < i, of size at most , T[k : l] occurs earlier in T) The codeword c j is the ordered pair {  } where The codeword c j is the ordered pair {  } where  - length of the context of block j  - rank of block j with respect to the context (according to some predetermined order - lexicographic, etc.)  Intuition: similar symbols in data appear in similar contexts

The (  )-HYZ Compression Problem (previous results) Case of  = O(1) and  is unbounded Case of  = O(1) and  is unbounded Average length of a codeword is shown to approach the conditional entropy for the block, within additive term of c 1 logH(C) + c 2 for constants c 1 and c 2, provided that the input is generated by a limited order Markovian source.Average length of a codeword is shown to approach the conditional entropy for the block, within additive term of c 1 logH(C) + c 2 for constants c 1 and c 2, provided that the input is generated by a limited order Markovian source. Case of  >= loglog t and  = O(log t ) Case of  >= loglog t and  = O(log t ) This scheme also achieves the optimal compression (in terms of CO)This scheme also achieves the optimal compression (in terms of CO) Applies for all ergodic sourcesApplies for all ergodic sources

The problem state Consider a set of document strings T = {T 1, T2,... T k }, of sizes t 1, t 2,..., t k. The goal is to build a data structure, supporting the following queries on an on-line pattern P of size p: list query (the list of documents containing P) list query (the list of documents containing P) count query (the number of documents containing P) count query (the number of documents containing P) Theorem 1 Given T and P, there is a data structure which responds to a count query in O(p) time, and a list query in O(p log k + out) time, where out is the number of documents in T that contain P.

The Suffix-DAG Data Structure Proof Sketch Build the suffix-DAG of documents T 1,..., T k, in O(t) = O(   t k ), using O(t) space Build the suffix-DAG of documents T 1,..., T k, in O(t) = O(   t k ), using O(t) space The suffix-DAG of T, denoted by SD(T), contains the generalized suffix tree, GST(T), of the set T at its core. The suffix-DAG of T, denoted by SD(T), contains the generalized suffix tree, GST(T), of the set T at its core. A GST(T) is defined to be the compact trie of all the suffixes of each of the documents in T. A GST(T) is defined to be the compact trie of all the suffixes of each of the documents in T. Each leaf node l in GST(T) is labeled with the list of documents which have a suffix represented by the path from the root to l. Each leaf node l in GST(T) is labeled with the list of documents which have a suffix represented by the path from the root to l. The substring represented by a path from the root to any given node n denoted by P(n) The substring represented by a path from the root to any given node n denoted by P(n)

The Suffix-DAG Data Structure (cont) The nodes of SD(T) are the nodes of GST(T) The nodes of SD(T) are the nodes of GST(T) The edges of SD(T) are of two types: The edges of SD(T) are of two types: the skeleton edges of SD(T) are the edges of GST(T)the skeleton edges of SD(T) are the edges of GST(T) the supportive edges of SD(T) are defined as follows: for any nodes n 1 and n 2 in SD(T) there is a pointer edge from n 1 to n 2 if and only if:the supportive edges of SD(T) are defined as follows: for any nodes n 1 and n 2 in SD(T) there is a pointer edge from n 1 to n 2 if and only if: –n 1 is an ancestor of n 2 –among the suffix trees ST(T 1 ), ST(T 2 ),..., ST(T k ) there exists at least one, say ST(T i ), which has two nodes, n 1, I and n 2, I such that P(n 1 ) = P(n 1, i ),P(n 2 ) = P(n 2, i )P(n 1 ) = P(n 1, i ),P(n 2 ) = P(n 2, i ) n 1, I is the parent of n 2, I n 1, I is the parent of n 2, I –such an edge is labeled with I for all relevant documents in Ti

The Suffix-DAG Data Structure (cont.2) In order to respond to the count and list queries, one of the standard data structures that support least common ancestor (LCA) queries on SD(T) in O(1) time was built. In order to respond to the count and list queries, one of the standard data structures that support least common ancestor (LCA) queries on SD(T) in O(1) time was built. Also each internal node n of SD(T) contains Also each internal node n of SD(T) contains array that stores its supportive edges in pre-order fashionarray that stores its supportive edges in pre-order fashion number of documents which include P(n) as substringnumber of documents which include P(n) as substring

Example The independent suffix trees The independent suffix trees T1: abcb ; T2: abca ; T3: ababT1: abcb ; T2: abca ; T3: ababST(T1)a b c b # c b # # b c b # a b c a # # b c a # c a #ST(T2) a b a b # # a b # #ST(T3)

Example (cont.) The generalized suffix tree of the set of documents {T1, T2, T3 } The generalized suffix tree of the set of documents {T1, T2, T3 }ab a b # b # c # a # # bcb # c a # b # # a b # a # , SK(T)

Example (cont.2) The suffix-DAG of the set of documents The suffix-DAG of the set of documentsab a b # b # c # a # # bcb # c a # b # # a b # a # , SD(T) 3 1,

Example (cont.3) The suffix-DAG of the set of documents The suffix-DAG of the set of documentsab a b # b # c # a # # bcb # c a # b # # a b # a # , SD(T) 2,2 3, ,1,3,1 1,3,2,1,3,2,1 legend: skeleton edge number of documents in the subtree and the pointer array

Lemmas and Proofs Lemma1 Lemma1 The suffix-DAG is sufficient to respond to the count queries in O(p) time, and to list queries in O(p log k + out) time Proof Sketch Proof Sketch To respond to count queries do as follows:To respond to count queries do as follows: –with P, trace down GST(T) until the highest level node n is reached for which P is a prefix of P(n) –return the number of documents that contain P(n) as a substring To respond to list queries do as followsTo respond to list queries do as follows –locate n in SD(T) (defined above) and traverse SD(T) backwards from n to the root –at each node u on the path determine all supportive edges out of u have their endpoints in the subtree rooted at n

Lemmas and Proofs (cont.) Complexity of list queries Complexity of list queries the key observation is that all corresponding edges will form a consecutive segment.the key observation is that all corresponding edges will form a consecutive segment. the segment may be identified with two binary searches (performing an LCA query, it takes O(1) time)the segment may be identified with two binary searches (performing an LCA query, it takes O(1) time) maximum size of the array of supportive edges in any node is at most k|  |, where |  | = O(1) ( size of alphabet), that means this procedure takes O(log k) at each node umaximum size of the array of supportive edges in any node is at most k|  |, where |  | = O(1) ( size of alphabet), that means this procedure takes O(log k) at each node u the output at all such segments may contain duplicate, but total size of the output is O(out|  |) = O(out), where out is the number of occurrences of P in tthe output at all such segments may contain duplicate, but total size of the output is O(out|  |) = O(out), where out is the number of occurrences of P in t

Lemmas and Proofs (cont.2) Lemma 2 Lemma 2 The suffix-DAG of the document set T can be constructed in O(t) time and O(t) space Proof Sketch Proof Sketch The construction of GST(T) with all suffix links and data structure are standard The construction of GST(T) with all suffix links and data structure are standard To complete SD(T) it is necessaryTo complete SD(T) it is necessary –to construct supportive edges –to build supportive edge array –to explain, how the number of documents that include P(n) is computed

Lemmas and Proofs (cont.3) Proof Sketch (cont.) Proof Sketch (cont.) the supportive edge with label i can be built by emulating constriction for ST(T i ) (for each node is ST(T i ), there is a corresponding node in SD(T) with the appropriate suffix link)the supportive edge with label i can be built by emulating constriction for ST(T i ) (for each node is ST(T i ), there is a corresponding node in SD(T) with the appropriate suffix link) supporting edges buildingsupporting edges building –if there is supportive edge between nodes n 1 and n 2, then either there is SE between n 1 `( there is suffix link from n i to n i `) and n 2 `, or there exist one intermediate node to which there is a supportive edge from n 1 ` (n 2 `). –The time to compute such nodes as length of string between nodes n 1 and n 2 to compute the number of documents #(n), containing the substring of n we need:to compute the number of documents #(n), containing the substring of n we need: –the number of supportive edges from n to its descendants # –the number of supportive edges to n from its ancestors # (n)

Lemmas and Proofs (cont.4) Lemma 3 Lemma 3 For any node n, #(n) = S n` E children of n #(n`)+ Proof Proof if a document T i includes the substring of more than one descendant of n, then there should exist a node in ST(T i ) whose substring is identical to that of nif a document T i includes the substring of more than one descendant of n, then there should exist a node in ST(T i ) whose substring is identical to that of n for any two supportive edges from n to n 1 and n 2, the path from n to n 1 and the path from n to n 2 do not have any common edgesfor any two supportive edges from n to n 1 and n 2, the path from n to n 1 and the path from n to n 2 do not have any common edges +

The Compression Algorithm Compression scheme C  terms Compression scheme C  terms the input is a string T of size tthe input is a string T of size t binary alphabetbinary alphabet The scheme The scheme partition the T into contiguous substrings (blocks) of size partition the T into contiguous substrings (blocks) of size  replace each block by a corresponding codeword (function of context)replace each block by a corresponding codeword (function of context) the context of a block T[i : j] is the longest substring T[k : i-1] for k < I, for which T[k :l] occurs earlier in T.the context of a block T[i : j] is the longest substring T[k : i-1] for k < I, for which T[k :l] occurs earlier in T. if context exceeds  it truncated for  rightmost charactersif context exceeds  it truncated for  rightmost characters the codeword c j is ordered pair  wherethe codeword c j is ordered pair  where –   is the context size – is the lexicographic order of block j among all possible substrings of size  immediately following earlier occurrences of context of block j

Compression Example T = T =  = 2,  = 1  = 2,  = 1 the context of block 9 is T[7 : 8] = 01 the context of block 9 is T[7 : 8] = 01 the two substrings which follow earlier occurrences of this context are T[3] = 0 and T[6] = 1 the two substrings which follow earlier occurrences of this context are T[3] = 0 and T[6] = 1 the lexicographic order of block 9 amount these substrings is 1 the lexicographic order of block 9 amount these substrings is 1

The Compression Algorithm(cont) Theorem 2 Theorem 2 There is an algorithm to implement the compression scheme C   which runs in O(t  ) time and requires O(t  ) space, independent of  Proof Proof Building suffix tree we augment it as follows for each node v we store an array of size  in which for each i = 1,...,  :for each node v we store an array of size  in which for each i = 1,...,  : –store the number of distinct paths rooted at v of precisely i characters minus the number of such distinct paths of precisely i-1 characters (the number may be negative)

The Compression Algorithm(cont.2) Lemma 4 Lemma 4 There is an algorithm to construct the augmented suffix tree of T in O(t  ) time Proof Proof While inserting a new node into suffix tree, update the subtree information of the ancestors of v which are at most b characters higher than v number of such ancestors is at most number of such ancestors is at most  at most one of the  fields of information at any ancestor of v needs to be updatedat most one of the  fields of information at any ancestor of v needs to be updated

The Compression Algorithm(cont.3) Lemma 5 Lemma 5 The augmented suffix tree is sufficient to compute the codeword for each block of input T in amortized O( ) Proof Sketch Proof Sketch The computation of  can be performed by locating the node in the suffix tree which represents the longest prefix of the context (can be achieved by using the suffix links in amortized O(  ) time)The computation of  can be performed by locating the node in the suffix tree which represents the longest prefix of the context (can be achieved by using the suffix links in amortized O(  ) time) The computation ofThe computation of –traverse the path between v and w (v represents the longest prefix of the context of block j, w, the descendant of v, represents the longest prefix of the substring formed by concatenating of lock j and its context) –during traversal, compute size of the relevant subtrees that smaler/greater then the substring represented by this path

The Compression Algorithm(cont.4) Theorem 3 Theorem 3 There is an algorithm to implement the compression method C  for  = log t and  = log log t in O(t) time using O(t) space Proof Sketch Proof Sketch For any descendents w i of v which are b chars apart from v, the DataStructure enables to compute lexicographic order of the path between any w i and v, allowing easy computation of codeword of block of size brepresented by path between v and w i  The algorithm exploits the fact that the context size is bounded, and its seeks similarities between suffixes of the input up to a small size.

The Compression Algorithm(cont.5) Lemma 6 Lemma 6 The augmented limited suffix tree of input T is sufficient to compute the codeword for any input block j in O(b) time Proof Sketch Proof Sketch Given a block j of the input Given a block j of the input v represents its context v represents its context w (descendant of v) represents substring: context(j) + block(j) w (descendant of v) represents substring: context(j) + block(j) b = log log t => the maximum number of elements in the search data structure for v is = O(log t) b = log log t => the maximum number of elements in the search data structure for v is = O(log t) there is a simple data structure that maintains k elements and compute the rank of any given element in O(log k) time there is a simple data structure that maintains k elements and compute the rank of any given element in O(log k) time the lexicographic order of node w in only O(log log t) = O(  ) time may be computed (in future work). the lexicographic order of node w in only O(log log t) = O(  ) time may be computed (in future work).

The Compression Algorithm(cont.6) Lemma 7 Lemma 7 The augmented limited suffix tree of input T can be built and maintained in O(t) time Proof Sketch Proof Sketch The depth of augmented limited suffix tree is bounded by log t => the total number of nodes in the tree is only O(t), allowing to adapt ST construction in O(t) time - without being penalized for building a suffix trie rather than the suffix tree.The depth of augmented limited suffix tree is bounded by log t => the total number of nodes in the tree is only O(t), allowing to adapt ST construction in O(t) time - without being penalized for building a suffix trie rather than the suffix tree. Because of each node v is inserted to the search data structure of at most one of its ancestors, it is possible to construct and maintain the search data structure of all nodes in O(t) time, => the total number of elements maintained by all search data structures is O(t)Because of each node v is inserted to the search data structure of at most one of its ancestors, it is possible to construct and maintain the search data structure of all nodes in O(t) time, => the total number of elements maintained by all search data structures is O(t) The insertion time of an element e to a search DS is O(1)The insertion time of an element e to a search DS is O(1) As the total number of of nodes to be inserted is the DS is bounded by O(t), it may be shown, that the total time for insertion of nodes in the search DS is O(t)As the total number of of nodes to be inserted is the DS is bounded by O(t), it may be shown, that the total time for insertion of nodes in the search DS is O(t)

Results and conclusions Document listing problem Processing k documents in linear time ( O(  i t i ) ) and spaceProcessing k documents in linear time ( O(  i t i ) ) and space Time to answer a query with pattern P is O( |P| log k + out)Time to answer a query with pattern P is O( |P| log k + out)  The fastest known algoritm runs in time proportional to number of occurrences of the pattern in all documents.

Results and conclusions (cont.) ( ,  )-HYZ compression problem  unbounded , complexity O(t  ) –gives linear time for  = O(1)  The only previously known algorithm, where for  the author presents an O(t  ) time algorithm, and for unbounded  this running time is O( )   = O(log t),  = log log t, complexity O(t)  There is no any previously known algorithms