Alternative Algorithms for Lyndon Factorization

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
Pattern Matching II COMP171 Fall Pattern matching 2 A Finite Automaton Approach * A directed graph that allows self-loop. * Each vertex denotes.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
Topics Automata Theory Grammars and Languages Complexities
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
New Lower Bounds for the Maximum Number of Runs in a String Wataru Matsubara 1, Kazuhiko Kusano 1, Akira Ishino 1, Hideo Bannai 2, Ayumi Shinohara 1 1.
CSC312 Automata Theory Lecture # 2 Languages.
Combinatorial aspects of the Burrows-Wheeler transform
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
CS Discrete Mathematical Structures Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 9:30-11:30a.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
Mathematical Notions and Terminology Lecture 2 Section 0.2 Fri, Aug 24, 2007.
Strings and Languages CS 130: Theory of Computation HMU textbook, Chapter 1 (Sec 1.5)
Tamanna Chhabra, M. Oguzhan Kulekci, and Jorma Tarhio Aalto University.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Strings and Languages Denning, Section 2.7. Alphabet An alphabet V is a finite nonempty set of symbols. Each symbol is a non- divisible or atomic object.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Universal Turing Machine
Languages and Strings Chapter 2. (1) Lexical analysis: Scan the program and break it up into variable names, numbers, etc. (2) Parsing: Create a tree.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
COMP9319 Web Data Compression and Search
HUFFMAN CODES.
Greedy Technique.
Introduction to LR Parsing
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Algorithms for iSNE Dr. Kenneth Cosh Week 13.
The Greedy Method and Text Compression
The Greedy Method and Text Compression
13 Text Processing Hongfei Yan June 1, 2016.
Binary Trees, Binary Search Trees
Chapter 8 – Binary Search Tree
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Data Structure and Algorithms
Suffix trees and suffix arrays
Improved Two-Way Bit-parallel Search
Lecture 5 Dynamic Programming
2019/5/14 New Shift table Algorithm For Multiple Variable Length String Pattern Matching Author: Punit Kanuga Presenter: Yi-Hsien Wu Conference: 2015.
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Analysis of Algorithms CS 477/677
Presentation transcript:

Alternative Algorithms for Lyndon Factorization Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio Aalto University Finland

Lyndon Word Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations.

Lyndon Word w=ab, w′=ba where u=a, v=b. w is lexicographically smaller than its rotation w′ . w is Lyndon word.

Examples of Lyndon words ab aabab Non-Lyndon words ba abaac abcaac

Lyndon factorization A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. Every string has a unique factorization in Lyndon words with corresponding sequence of factors is non- increasing with respect to lexicographical order. The Lyndon factorization has importance in a recent method for sorting the suffixes of a text.

Examples of Lyndon factorization abcaabcaaabcaaaabc -> abc aabc aaabc aaaabc aacaacaacaad -> aacaacaacaad abacabab -> abac ab ab

Duval’s algorithm For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. Non-empty prefixes of Lyndon words are all of the form (uv)ku. Duval’s algorithm compute the factorization using a left to right parsing.

Computing Lyndon factorization for T=aabaabaaac For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. Then there are two cases, depending on the next symbol to be read.

Computing Lyndon factorization for T=aabaabaaac For i=3 having P = aab. With u = empty string, v = aab and k = 1. The next symbol to read is 'a' and aaba is still a prefix of a Lyndon word. The next iteration then starts with P = aaba.

Computing Lyndon factorization for T=aabaabaaac For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. Output is two times aab and the next iteration starts on the suffix aaac of T with P = a.

Variations of Duval’s algorithm. First variation is designed with LF skip algorithm. Second variation is for strings compressed with run- length encoding.

LF skip algorithm The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm .

LF skip algorithm Let c be the smallest symbol in Σ. There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, . . . , wm. This property is used to devise an algorithm for Lyndon factorization that skip symbols.

LF skip algorithm Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. aaadxxxxxxxxxxxaaac ---^---^--^^+++

Run Length Encoding Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. Given string: aaaaaabbbccaaabbbccbbbbbaaa RLE: a6b3c2a3b3c2b5a3

Lyndon factorization of RLE string The second variation is for strings compressed with run- length encoding. Strings are stored in RLE for preferably.

Lyndon factorization of RLE string The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors.

Computing Lyndon factorization from RLE for T=aabaabaaac For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. RLE algorithm works in it is similar, except the runs are read instead of symbols.

Computing Lyndon factorization from RLE for T=aabaabaaac For i = 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. For i = 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. Next iteration starts on the suffix aaac of T with P = aaa.

Complexity Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding.

Experimental results LF-skip algorithm and Duval’s algorithm with various texts. LF-skip gave a significant speed-up over Duval’s algorithm. Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes.

Speed-up of LF-skip

Conclusion Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. The first algorithm is designed that skips a significant portion of the characters. Experimental results show that the algorithm is considerably faster than Duval’s original algorithm. The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time.

THANK YOU