Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combinatorial aspects of the Burrows-Wheeler transform

Similar presentations


Presentation on theme: "Combinatorial aspects of the Burrows-Wheeler transform"— Presentation transcript:

1 Combinatorial aspects of the Burrows-Wheeler transform
Sabrina Mantaci Antonio Restivo Marinella Sciortino University of Palermo

2 Burrows-Wheeler Transform
In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that: the transformed string is easier to compress than the original one. the original string can be recovered; The use of this preprocessing allowed to define a class of lossless data compression algorithms that: achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv; obtains a compression ratio close to the best statistical modelling techniques.

3 How does BWT work ? INPUT: w = abraca Lexicographically sort the cyclic rotations of w F L 0 a a b r a c 1 a b r a c a 2 a c a a b r 3 b r a c a a 4 c a a b r a 5 r a c a a b The following properties hold: the character L[i] is followed in w by F[i]; for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L. OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering. I

4 Reversibility The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w. Given L=BWT(w)=caraab and I=1: Construct F by alphabetically sorting the letters in L F 0 a 1 a 2 a 3 b 4 c 5 r L c 0 a 1 r 2 a 3 a 4 b 5 I  :  = a c r b w= a Define a permutation  on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L; Starting from position I, we can recover w=w0 … wn as follows: wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))

5 We can deduce that: REMARK: Two words x and y are conjugate  BWT(x)=BWT(y) PROPOSITION: If and BWT(v)=a0a1…an-1 then BWT(u)= ; If BWT(v)=a0a1…an-1 and BWT(u)= then there exists a conjugate u’ of u such that u’=vd. Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.

6 Standard Words d1, d2,…,dn,… a sequence of natural numbers
d10, >0 i =2,…,n Consider the sequence {sn}n 0 defined as: s is a characteristic Sturmian word {sn} 0 is called approximating sequence of s (d1, d2,…,dn,… ) is the directive sequence of s Each finite word sn is a standard word

7 Characterization of standard words
A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2. (extremal case of Fine and Wilf theorem) A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}. Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.

8 Rotations Standard words can also be generated by rotations.
Let p,q2 such that gcd(p,q)=1 and n=p+q. p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n) Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}  : {0,1,…n-1} {a,b} defined as:  (x )=a if x Ia, b otherwise. 1 2 4 3 6 5 7 a b If n=8, p=3, q=5,… w=abaababa THEOREM: Let w=x0x1…xn-1 in {a,b}* , |w|a=q and |w|b=p. w is a standard word with suffix ba  xi= w is a standard word with suffix ab  xi= REMARK : Let u=u0u1…un-1 , v=v0v1…vn-1 If ui= and vi= then u and v are conjugate.

9 A new characterization of standard words
THEOREM: Let u be a word over the alphabet {a,b}. BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word. In particular, in order to reconstruct u from BWT(u) and the index I: if I=p then u is a standard word with suffix ba if I=p-1 then u is a standard word with suffix ab COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word.

10 Idea of the proof: F  : L 0 a b 0 1 a b 1 2 a b 2 3 a a 3 4 a a 4 5 b
The permutation  giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n). Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).

11 Further Research Further Research
Study extremal case of the BWT for k-letters alphabets with k>2. For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*. This property does work neither with 3-Standard words nor with balanced words. Does a relation between the complexity function of a word w and the structure of BWT(w) exist? Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy. We found negative results L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language

12 Further Research Is it possible to characterize interesting families of words in terms of their BWT? Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba. Denote by vR the reversal word of v and by v the word obtained by interchanging a with b and vice-versa. Then: BWT(mn(a))=vvR Where v=b2n-2a2n-3b2n-4...b20a if n is even v=b2n-2a2n-3b2n-4...a20b if n is odd


Download ppt "Combinatorial aspects of the Burrows-Wheeler transform"

Similar presentations


Ads by Google