Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization
Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio Aalto University Finland

Lyndon Word Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations.

Lyndon Word w=ab, w′=ba where u=a, v=b.
w is lexicographically smaller than its rotation w′ . w is Lyndon word.

Examples of Lyndon words
ab aabab Non-Lyndon words ba abaac abcaac

Lyndon factorization A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. Every string has a unique factorization in Lyndon words with corresponding sequence of factors is non- increasing with respect to lexicographical order. The Lyndon factorization has importance in a recent method for sorting the suffixes of a text.

Examples of Lyndon factorization
abcaabcaaabcaaaabc -> abc aabc aaabc aaaabc aacaacaacaad -> aacaacaacaad abacabab -> abac ab ab

Duval’s algorithm For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. Non-empty prefixes of Lyndon words are all of the form (uv)ku. Duval’s algorithm compute the factorization using a left to right parsing.

Computing Lyndon factorization for T=aabaabaaac
For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. Then there are two cases, depending on the next symbol to be read.

For i=3 having P = aab. With u = empty string, v = aab and k = 1. The next symbol to read is 'a' and aaba is still a prefix of a Lyndon word. The next iteration then starts with P = aaba.

For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. Output is two times aab and the next iteration starts on the suffix aaac of T with P = a.

Variations of Duval’s algorithm.
First variation is designed with LF skip algorithm. Second variation is for strings compressed with run- length encoding.

LF skip algorithm The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm .

LF skip algorithm Let c be the smallest symbol in Σ.
There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, , wm. This property is used to devise an algorithm for Lyndon factorization that skip symbols.

LF skip algorithm Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. aaadxxxxxxxxxxxaaac ^---^--^^+++

Run Length Encoding Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. Given string: aaaaaabbbccaaabbbccbbbbbaaa RLE: a6b3c2a3b3c2b5a3

Lyndon factorization of RLE string
The second variation is for strings compressed with run- length encoding. Strings are stored in RLE for preferably.

Lyndon factorization of RLE string
The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors.

Computing Lyndon factorization from RLE for T=aabaabaaac
For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. RLE algorithm works in it is similar, except the runs are read instead of symbols.

Computing Lyndon factorization from RLE for T=aabaabaaac
For i = 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. For i = 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. Next iteration starts on the suffix aaac of T with P = aaa.

Complexity Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding.

Experimental results LF-skip algorithm and Duval’s algorithm with various texts. LF-skip gave a significant speed-up over Duval’s algorithm. Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes.

Speed-up of LF-skip

Conclusion Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. The first algorithm is designed that skips a significant portion of the characters. Experimental results show that the algorithm is considerably faster than Duval’s original algorithm. The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time.

THANK YOU

Alternative Algorithms for Lyndon Factorization

Similar presentations

Presentation on theme: "Alternative Algorithms for Lyndon Factorization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alternative Algorithms for Lyndon Factorization

Similar presentations

Presentation on theme: "Alternative Algorithms for Lyndon Factorization"— Presentation transcript:

Similar presentations

About project

Feedback