Combinatorial aspects of the Burrows-Wheeler transform

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate
1 Turing Machines and Equivalent Models Section 13.2 The Church-Turing Thesis.
Lecture 24 MAS 714 Hartmut Klauck
Properties of Regular Languages
Deterministic Finite Automata (DFA)
Determinization of Büchi Automata
CSCI 2670 Introduction to Theory of Computing September 13, 2005.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Yangjun Chen 1 Bipartite Graphs What is a bipartite graph? Properties of bipartite graphs Matching and maximum matching - alternative paths - augmenting.
Introduction to Computability Theory
Regular Languages Sequential Machine Theory Prof. K. J. Hintz Department of Electrical and Computer Engineering Lecture 3 Comments, additions and modifications.
A new approach to collapsing words Alessandra Cherubini Andrzej Kisielewicz Pavel Gawryochowski Brunetto Piochi.
Yangjun Chen 1 Bipartite Graph 1.A graph G is bipartite if the node set V can be partitioned into two sets V 1 and V 2 in such a way that no nodes from.
The Burrows-Wheeler Transform
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression Sabrina Mantaci Antonio Restivo Giovanna Rosone.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Induction and recursion
Chapter 12: Context-Free Languages and Pushdown Automata
Cardinality of Sets Section 2.5.
Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
On the intersection of submonoids of the free monoid L. Giambruno A. Restivo.
1 Strings and Languages. 2 Review Sets and sequences Functions and relations Graphs Boolean logic:      Proof techniques: – Construction, Contradiction,
Two examples English-Words English-Sentences alphabet S ={a,b,c,d,…}
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
Moore automata and epichristoffel words
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Computing languages by (bounded) local sets Dora Giammarresi Università di Roma “Tor Vergata” Italy.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
Simple Algorithm for Sorting the Fibonacci String Rotations Manolis Christodoulakis King’s College London Joint work with Costas S. Iliopoulos Yoan José.
Chapter 5: Permutation Groups  Definitions and Notations  Cycle Notation  Properties of Permutations.
Mathematical Preliminaries
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
CS 203: Introduction to Formal Languages and Automata
CompSci 102 Discrete Math for Computer Science February 7, 2012 Prof. Rodger Slides modified from Rosen.
ON THE EXPRESSIVE POWER OF SHUFFLE PRODUCT Antonio Restivo Università di Palermo.
Classifications LanguageGrammarAutomaton Regular, right- linear Right-linear, left-linear DFA, NFA Context-free PDA Context- sensitive LBA Recursively.
A new combinatorial approach to sequence comparson Sabrina Mantaci University of Palermo Joint work with Antonio Restivo, Giovanna Rosone and Marinella.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Unrestricted Grammars
1 Chapter Pushdown Automata. 2 Section 12.2 Pushdown Automata A pushdown automaton (PDA) is a finite automaton with a stack that has stack operations.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Burrows-Wheeler Transformation Review
Alternative Algorithms for Lyndon Factorization
BWT-Transformation What is BWT-transformation? BWT string compression
Information and Coding Theory
Context-Free Grammars: an overview
Lecture12 The Halting Problem
Theory of Computation Theory of computation is mainly concerned with the study of how problems can be solved using algorithms.  Therefore, we can infer.
Pushdown Automata.
Complexity and Computability Theory I
A HIERARCHY OF FORMAL LANGUAGES AND AUTOMATA
CS 154, Lecture 3: DFANFA, Regular Expressions.
CSC2431 February 3rd 2010 Alecia Fowler
Kleene’s Theorem Muhammad Arif 12/6/2018.
Advanced Seminar in Data Structures
Chapter 1 Introduction to the Theory of Computation
Cardinality Definition: The cardinality of a set A is equal to the cardinality of a set B, denoted |A| = |B|, if and only if there is a one-to-one correspondence.
Decidability continued….
More Undecidable Problems
Presentation transcript:

Combinatorial aspects of the Burrows-Wheeler transform Sabrina Mantaci Antonio Restivo Marinella Sciortino University of Palermo

Burrows-Wheeler Transform In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that: the transformed string is easier to compress than the original one. the original string can be recovered; The use of this preprocessing allowed to define a class of lossless data compression algorithms that: achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv; obtains a compression ratio close to the best statistical modelling techniques.

How does BWT work ? INPUT: w = abraca Lexicographically sort the cyclic rotations of w F L 0 a a b r a c 1 a b r a c a 2 a c a a b r 3 b r a c a a 4 c a a b r a 5 r a c a a b The following properties hold: the character L[i] is followed in w by F[i]; for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L. OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering. I

Reversibility The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w. Given L=BWT(w)=caraab and I=1: Construct F by alphabetically sorting the letters in L F 0 a 1 a 2 a 3 b 4 c 5 r L c 0 a 1 r 2 a 3 a 4 b 5 I  :  = 0 1 2 3 4 5 1 3 4 5 0 2 a c r b w= a Define a permutation  on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L; Starting from position I, we can recover w=w0 … wn as follows: wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))

We can deduce that: REMARK: Two words x and y are conjugate  BWT(x)=BWT(y) PROPOSITION: If and BWT(v)=a0a1…an-1 then BWT(u)= ; If BWT(v)=a0a1…an-1 and BWT(u)= then there exists a conjugate u’ of u such that u’=vd. Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.

Standard Words d1, d2,…,dn,… a sequence of natural numbers d10, >0 i =2,…,n Consider the sequence {sn}n 0 defined as: s is a characteristic Sturmian word {sn} 0 is called approximating sequence of s (d1, d2,…,dn,… ) is the directive sequence of s Each finite word sn is a standard word

Characterization of standard words A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2. (extremal case of Fine and Wilf theorem) A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}. Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.

Rotations Standard words can also be generated by rotations. Let p,q2 such that gcd(p,q)=1 and n=p+q. p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n) Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}  : {0,1,…n-1} {a,b} defined as:  (x )=a if x Ia, b otherwise. 1 2 4 3 6 5 7 a b If n=8, p=3, q=5,… w=abaababa THEOREM: Let w=x0x1…xn-1 in {a,b}* , |w|a=q and |w|b=p. w is a standard word with suffix ba  xi= w is a standard word with suffix ab  xi= REMARK : Let u=u0u1…un-1 , v=v0v1…vn-1 If ui= and vi= then u and v are conjugate.

A new characterization of standard words THEOREM: Let u be a word over the alphabet {a,b}. BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word. In particular, in order to reconstruct u from BWT(u) and the index I: if I=p then u is a standard word with suffix ba if I=p-1 then u is a standard word with suffix ab COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word.

Idea of the proof: F  : L 0 a b 0 1 a b 1 2 a b 2 3 a a 3 4 a a 4 5 b The permutation  giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n). Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).

Further Research Further Research Study extremal case of the BWT for k-letters alphabets with k>2. For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*. This property does work neither with 3-Standard words nor with balanced words. Does a relation between the complexity function of a word w and the structure of BWT(w) exist? Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy. We found negative results L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language

Further Research Is it possible to characterize interesting families of words in terms of their BWT? Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba. Denote by vR the reversal word of v and by v the word obtained by interchanging a with b and vice-versa. Then: BWT(mn(a))=vvR Where v=b2n-2a2n-3b2n-4...b20a if n is even v=b2n-2a2n-3b2n-4...a20b if n is odd