Burrows-Wheeler Transformation Review
Compress techniques Lossless: Lossy: Huffman coding Run-length coding(rle) Lempel-ziv (lz77) Burrows-wheeler transfom(BWT) Lossy: Used to handle audio, image and video.
Huffman coding If we send a telegram with a content of ‘a b a c c d a’, for there are four different charaters, we can use two bits to code them. 00:a 01:b 10:c 11:d “abaccda” can be coded as ‘00010010101100’
Run-length encoding To encoding the data, rle transform a sequence of same data into a specific data format. Input={ 1,1,1,1,1,1 }; Output={ 6,1 } Input={ 6,1,0,1,1,1,1,1,1 }; Output={6,1,0,6,1} So we need a control code(can use the least occurrence code)! So 6,1 means what?
Lempel-ziv (lz77) LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. Input=“the brown fox jumped over the brown foxy jumping frog” The result is
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion
Burrows-Wheeler Transformation Burrows-Wheeler Transformation(BWT) is first proposed by Burrows and Wheeler in 1994. Properties: BWT Compression deals with data block, not data stream. BWT itself don’t compress data , it changes the data permutation to make data compressible. BWT Compression can achieve good results in a competitive time cost 。
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion
Procedures of BWT Three steps: S=abraca Cyclically shifting block data with length N. Sorting results of step ① and get matrix M. Output the last column L and the index of original string in M. index result 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab index results 0 abraca 1 bracaa 2 racaab 3 acaabr 4 caabra 5 aabrac Cyclically shifts Sort L=caraab Index=1 S=abraca Output
Reversible Procedures of BWT Occ(i) means occurrences of L[i] in the prefix L[0…i-1] C[c] means numbers of char c which has a lower order LF[5]=C[a]+Occ(5)=1+2=3 It will be critical to understand the following two properties of matrix M. For simplicity, for any character, it have the same relative position in F and L. let us see an example: How to compute the LF array? index sort results 0 $abraca 1 a$abrac 2 abraca$ 3 aca$abr 4 braca$a 5 ca$abra 6 raca$ab L[5]== M[?] Occ(5) C[a] For the i-th row of M, the last character L[i] precedes the first character F[i] in the original string S, namely …L[i]F[i] and F is the first column of M and can be obtained by sorting L. Last-to-First mapping (LF-mapping). Let L[i] =c and let ri be the number of occurrences of c in the prefix L[0,i-1]. Let M[j] be the ri-th row of the M starting with c. Then the character F[j] in the first column corresponds to L[i] in the last column and set LF-mapping array LF[i]=j, meaning that F[j] and L[i] are the same character in original string. index sorting result 0 $abraca 1 $abrac 2 braca$ 3 ca$abr 4 braca$a 5 ca$abra raca$ab aa a three rows are still in order three rows are in order L[5]== M[3]
Reversible Procedures of BWT Algorithm BWT_reverse(L) 1. i=0; //M[0]=$S 2. for j=N-1 to 0 3. S[j]=L[i]; 4. i=Occ[i]+C[L[i]]; //compute the LF-mapping array S’=abraca$ I=0 a ca aca raca braca abraca 0 $abraca $abraca $abraca $abraca $abraca $abraca $abraca a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr braca$a braca$a braca$a braca$a braca$a braca$a braca$a ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab S[5]=a S[4]=c S[3]=a S[2]=r S[1]=b S[0]=a End! End with meeting$
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based on BWT What is FM-index FM-index and Compression based on BWT Experiment results Conclusion
Compression based on BWT BWT itself don’t compress data, so compression based on BWT combined the BWT with currently compression techniques. BWT GST RLE EC Input data Output data Move to Front(MTF) Run length Encoding(RLE) Huffman coding/ Entropy coding
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
FM-index based on BWT Step 2:Locating the occurrences Algorithm counting(P[0,p-1]) c=P[p-1], i=p-1; sp=C[c], ep=C[c+1]-1; while((sp≤ep)&&(i≥1)) do c=P[i-1]; sp=C[c]+Occ(c,sp); ep=C[c]+Occ(c,ep); i=i-1; if(ep<sp) return 0; else return ep-sp+1; C[c] means the number of char which has a lower order than c. Occ(c,i) means the occurrences of c in prefix L[0…i-1]. In the i-th iteration: sp points to the start postion of pattern P[i, p-1]. ep points to the end postion of pattern P[i, p-1] Step 2:Locating the occurrences How FM-index Works? What is FM-index? S’=abraca$ P=aca sp ep aca suffix array 0 $abraca 6 $abraca 1 a$abrac 5 a$abrac 2 abraca$ 0 abraca$ 3 aca$abr 3 aca$abr 4 braca$a 1 braca$a 5 ca$abra 4 ca$abra 6 raca$ab 2 raca$ab pos(6)=2 LF-mapping M[6] is a marked row. Set pos(3)=pos(6)+1=3 Full-text ,minute-space index. S=abraca P=aca sp ep aca aca aca 0 $abraca $abraca $abraca 1 a$abrac a$abrac a$abrac 2 abraca$ abraca$ abraca$ 3 aca$abr aca$abr aca$abr 4 braca$a braca$a braca$a 5 ca$abra ca$abra ca$abra 6 raca$ab raca$ab raca$ab sp=C[c]+Occ(c,1)=5+0=5 ep=C[c]+Occ(c,3) =5+0=5 sp=C[a]+Occ(a,5)=1+2=3 ep=C[a]+Occ(a,5)=1+2=3 FM-index consists two steps: 1)Counting the occurences of te matching pattern 2)Locating the occurrences. FM-index combines the BWT-based compression algorithm with suffix array data structure, and achieves effective random accesses to the compressed data without uncompressing all of them at query time.
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
Relationship BWT GST RLE0 EC Compression algorithm based on BWT Auxiliary information Partition the BWT result into buckets FM-index FM-index based on BWT Input data Output data
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
Experiment results We compare several tools which is widely used, including gzip (v1.2.4), szip(v1.12a),bzip2(v 1.0.6), bicom(v 1.01). Bzip2 and szip are based on BWT, gzip is based on LZ77 . bicom is based on PPM. the result is showed Below. File File size Bicom Szip Bzip2 gzip Large.txt 4,047,392 1.69 1.63 1.67 2.35 E.Coli 4,638,690 2.12 2.02 2.16 2.31 World192.txt 2,473,400 1.44 1.60 1.58 2.34
Experiment results Read length Program CPU time Peak memory (megabytes) Speed- up Reads aligned 36 bp Bowtie 6m15s 1,305 - 62.2 Maq 3h52m26s 804 36.7x 65.0 Bowtie –v 2 4m55s 1,138 - 55.0 SOAP 16h44m3s 13,619 216x 55.1 50 bp Bowtie 7m11s 1,310 - 67.5 Maq 2h39m56s 804 21.8x 67.9 Bowtie –v 2 5m32s 1,138 - 56.2 SOAP 48h42m4s 13,619 691x 56.2 76 bp Bowtie 18m58s 1,323 - 44.5 Maq 0.7.1 4h45m7s 1,155 14.9x 44.9 Bowtie –v 2 7m35s 1,138 - 31.7 SOAP do not support
Outline Burrows-Wheeler Transformation Procedures of BWT Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
Conclusion BWT is a data transformation method. Both Compression techniques and FM-Index based on BWT achieve good results at low time cost. Dynamic FM-index will be an interest topic.
Thanks The end! Thanks!