Ravello, 19-20-21/09C.E. On some researches... Chiara Epifanio.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Greedy Algorithms Greed is good. (Some of the time)
Recursive Definitions and Structural Induction
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Section 7.4: Closures of Relations Let R be a relation on a set A. We have talked about 6 properties that a relation on a set may or may not possess: reflexive,
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Modern Information Retrieval
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
CPM04, 05/07/04C.E. A Trie-Based Approach for Compacting Automata M. Crochemore, C. Epifanio, R. Grossi, F. Mignosi.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
Data Structures – LECTURE 10 Huffman coding
CS5371 Theory of Computation Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)
Gomory-Hu Tree for representation of minimum cuts Elad Avni.
Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.
Courtesy Costas Busch - RPI1 Mathematical Preliminaries.
Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.
Theoretical Computer Science COMP 335 Fall 2004
1 Background Information for the Pumping Lemma for Context-Free Languages Definition: Let G = (V, T, P, S) be a CFL. If every production in P is of the.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Induction and recursion
Induction and recursion
Section 5.3. Section Summary Recursively Defined Functions Recursively Defined Sets and Structures Structural Induction.
Mathematical Preliminaries Strings and Languages Preliminaries 1.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
CS 203: Introduction to Formal Languages and Automata
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Mathematical Induction Section 5.1. Climbing an Infinite Ladder Suppose we have an infinite ladder: 1.We can reach the first rung of the ladder. 2.If.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
WEEK 5 The Disjoint Set Class Ch CE222 Dr. Senem Kumova Metin
Linear-time computation of local periods Linear-time computation of local periods Gregory Kucherov INRIA/LORIA Nancy, France joint work with Roman Kolpakov.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Chapter 5 With Question/Answer Animations 1. Chapter Summary Mathematical Induction - Sec 5.1 Strong Induction and Well-Ordering - Sec 5.2 Lecture 18.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Trees.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Chapter 5. Optimal Matchings
Mathematical Preliminaries Strings and Languages
Huffman Coding Greedy Algorithm
Presentation transcript:

Ravello, /09C.E. On some researches... Chiara Epifanio

Ravello, /09C.E. Outline Compact representation of local automata The multidimensional Critical Factorizazion Theorem

Ravello, /09C.E. The multidimensional Critical Factorization Theorem Chiara Epifanio, Filippo Mignosi

Ravello, /09C.E. A word is a sequence of characters over an alphabet A, NZ w  A {1,2,…n}, A N, A Z N w=a 1 …a n is periodic if  p  N s. t. w(x+p)= w(x)  x,1  x  n-p W p is a period of w

Ravello, /09C.E. a word may have more than a period (e. g. abaababaabaababaaba, that has periods 8 and 13) the smallest period of w is called “the” period of w.

Ravello, /09C.E. A factor v=w j …w j+n-1 of length n of w is a repetition of order  if there exists a natural number p, 0  p  n such that w i =w i+p for i = j,…,j+n-1-p and such that n/p . The number p is called a period of the repetition. The smallest period of the repetition is called the period of the repetition. Ex: abaaba Repetition of period 6 and order 1 period 5 and order 6/5 period 3 and order 2

Ravello, /09C.E. Word w has a central repetition of order  in position i if there exists a factor v centered in i that is a repetition of order . In this case we denote c  (w,i) the smallest period among all the central repetitions of order  in position i and we call it the central local period of order  in i. i We denote by P  (w) the maximum of the central local periods of order  in w. A position i is critical if c  (w,i)=P  (w). v

Ravello, /09C.E. The Critical Factorization Theorem Let w be a word having length |w|  2. In every sequence of l  max {1, p(w)-1} consecutive positions there is a critical one and P  (w)=p(w),  =2.

Ravello, /09C.E. The Critical factorization Theorem in particular states that for  =2 there exists at least one point such that the central local period detected at this point coincides with the (global) period of the word, i.e., there exists an integer j, 1  j  |w|, such that c  (w,j) =p(w),  =2. We have given a new proof for  =4.

Ravello, /09C.E. uv vw vwu Lemma 1 Let u, v, w be words such that uv and vw have period p and |v|  p. Then the word uvw has period p. (cf. Lemma 8.1.2,Lothaire 2 chapter 8)

Ravello, /09C.E. w v v w Lemma 2 Suppose that w has period q and that there exists a factor v of w with |v|  q that has period r, when r divides q. Then w has period r. (cf. Lemma 8.1.3,Lothaire 2 chapter 8)

Ravello, /09C.E. Fine and Wilf Theorem Let w be a word having periods p and q, with q  p. If |w|  p + q - gcd(p,q), then w has also period gcd(p,q).

Ravello, /09C.E. Multidimensional case (Multidimensional periodicity was introduced by Amir and Benson for the design of Pattern Matching algorithms (1991). Since then, lots of people worked on it giving slightly different definitions).

Ravello, /09C.E. If u is a factor of w then v is a periodicity vector for u if w((x,y)+v) = w(x,y)  (x,y)  Dom(u) t.c. ((x,y)+ v)  Dom(u) u v is a periodicity vector for w if w((x,y)+v) = w(x,y)  (x,y)

Ravello, /09C.E. A factor u of w is lattice-periodic with respect to v 1 and v 2 if  v  is a periodicity vector for u. L= =

Ravello, /09C.E. Given a subgroup H of Z d, a transversal T H of H is a subset of Z d such that for any element i  Z d, there exists an unique element j  T H such that i-j  H. An n-cubic factor v is a repetition of order , if v is L periodic, L lattice; n is such that n/h L , where h L is the smallest integer such that every hypercube of side h L contains a transversal of L. The lattice L is called a period of the  - repetition v.

Ravello, /09C.E. Word w has a central repetition of order  in position j  Z d if there exists a factor v of w centered in j that is a repetition of order . If w has at least a central repetition of order  and period L in j, the set H={h L s.t. every hypercube of side h L contains a transversal of L} We denote c  (w,j)=min(H). Let P  (w) = limsup{c  (w,j), j position in w}

Ravello, /09C.E. Lemma 3 Let v 1 and v 2 be two factors of same word w  Z d that have both period a subgroup H. If sh(v 1 )  sh(v 2 ) contains a transversal of H then the factor v having shape sh(v)= sh(v 1 )  sh(v 2 ) has also period H. sh(v 1 ) sh(v 2 ) sh(v)

Ravello, /09C.E. Lemma 4 Let v 1 and v 2 be two factors of same word w  Z d such that sh(v 2 )  sh(v 1 ). Suppose that v 1 has period H 1 and that v 2 has period H 2, with H 1 subgroup of H 2 and that sh(v 2 ) contains a transversal of H 1. Under these hypotheses v 1 has period H 2. sh(v 1 ) sh(v 2 )

Ravello, /09C.E. A generalization of the Fine & Wilf Theorem If w has two periodicity vectors v 1 and v 2 and w is “big enough” with respect to v 1 and v 2, then w is lattice-periodic with respect to v 1 and v 2.

Ravello, /09C.E. The multidimensional Critical Factorization Theorem Informally, the C.F.T. states that the maximal local repetition of order 2 is also a period of the whole word. But …. there is no total order among lattices!! Our solution is to order lattices by using the length h L of the side of the smallest hypercube that contains a transversal of L. We have further to prove that all the lattices with same maximal h L coincide over the word. To do this, for the moment, we loose the tightness of the local repetition order (4 instead of 2).

Ravello, /09C.E. Theorem Let w be a cubic bidimensional word, X be a cube included in the shape of w. Every cube T  X, of side max(1,P 4 (X)-1) contains a position l such that c 4 (w,l)=P 4 (w). Let v be the factor of w having shape the intersection between sh(w) and the union X’ of the shapes of the 4-repetitions centered in position l  X such that c 4 (w,l)=P 4 (X). Then v has period L, where L is a subgroup such that every cube of side P 4 (X) contains a transversal of L. sh(v)

Ravello, /09C.E. Proof of the theorem Lemma4Fine & Wilf generalizationLemma 3 Thesis

Ravello, /09C.E. Importance of the extension to the d-dimensional case (d  2). Difficulties on such an extension (new definitions, extension of already known results). It is known that for d=1 the tight value is  =2. It remains an open problem to find the tight value of  for any dimension. Applications. Conclusions and open problems

Ravello, /09C.E. Compact representation of local automata M. Crochemore, C. Epifanio, R. Grossi, F. Mignosi

Ravello, /09C.E. Compacting is a standard technique used for reducing the size of data structures such as factor automata, DAWG and suffix trees and consists on replacing paths in automata with single edges. In 2000 Crochemore, Mignosi, Restivo and Salemi gave an algorithm for “self-compressing” trie of antifactorial binary sets of words. The aim of that algorithm was to represent in a compact way antidictionaries to be sent to the decoder of a static compression scheme. What we have worked on is an improvement scheme of that algorithm that works for sets of words over any alphabet.

Ravello, /09C.E. The suffix trie of a word Tr(w) is a trie where the set of leaves is the set of suffixes of w that does not appear previously as a factor in w. Ex.:

Ravello, /09C.E. The suffix tree T(w) of a word w is a compressed suffix trie, where only leaves and forks are kept. Each edge is labelled with a substring of w. In this way the number of nodes and leaves of T(w) is smaller than 2|w|. But if the labels of arcs are stored explicitely, the implementation can have quadratic size. The simple solution is to represent labels by pairs of integers (position, position) or (position, length) and to keep the text aside. Ex.:

Ravello, /09C.E. There are classical on-line linear time implementations. All of them use suffix link function s, that is defined over all the nodes of the suffix trie and suffix tree by s(root)=root s(v)=v’, where  v =a  v’,  v being the labelling of the path form the root to v and a being the first letter of  v. Ex.:

Ravello, /09C.E. Our new approach is basically the same one of the suffix tree, but we compact a bit less, i.e. we keep all nodes of the suffix tree and some more nodes of the trie, that are all the nodes v of the trie such that s(v) is a node of the suffix tree. In this case for any arc of the form (v,v’) with label a in the trie we have an arc (v,x) with same label in our compacted trie T 2 (w), where x is v’, if v’  T 2 (w); the first node in T 2 (w) that is a descendant of v’ in the original trie, if v’  T 2 (w). In this second case, we consider that (v,x) represents the whole path from v to x in the suffix trie and we add a sign + to node x in order to maintain this information.

Ravello, /09C.E. To complete the definition of T 2 (w) we keep the suffix link function over these nodes. Notice that, by definition, for any node v of T 2 (w), s(v) is always a node of the suffix tree T(w) and hence it also belongs to T 2 (w). This new approach let us not to maintain the text aside.

Ravello, /09C.E. State of the art We have given compacting and decompacting algorithms; we have proved that the number of nodes in our compacted suffix tree is still linear; we have given an algorithm that can be used to check whether a pattern is present in a text, without “decompacting” the automaton; actually we are doing some experiments on the Calgary and Canterbury corpus.