Download presentation
Presentation is loading. Please wait.
1
Burrows-Wheeler Transformation Review
2
Compress techniques Lossless: Lossy: Huffman coding
Run-length coding(rle) Lempel-ziv (lz77) Burrows-wheeler transfom(BWT) Lossy: Used to handle audio, image and video.
3
Huffman coding If we send a telegram with a content of ‘a b a c c d a’, for there are four different charaters, we can use two bits to code them. 00:a 01:b 10:c :d “abaccda” can be coded as ‘ ’
4
Run-length encoding To encoding the data, rle transform a sequence of same data into a specific data format. Input={ 1,1,1,1,1,1 }; Output={ 6,1 } Input={ 6,1,0,1,1,1,1,1,1 }; Output={6,1,0,6,1} So we need a control code(can use the least occurrence code)! So 6,1 means what?
5
Lempel-ziv (lz77) LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. Input=“the brown fox jumped over the brown foxy jumping frog” The result is
6
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion
7
Burrows-Wheeler Transformation
Burrows-Wheeler Transformation(BWT) is first proposed by Burrows and Wheeler in 1994. Properties: BWT Compression deals with data block, not data stream. BWT itself don’t compress data , it changes the data permutation to make data compressible. BWT Compression can achieve good results in a competitive time cost 。
8
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion
9
Procedures of BWT Three steps: S=abraca
Cyclically shifting block data with length N. Sorting results of step ① and get matrix M. Output the last column L and the index of original string in M. index result 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab index results 0 abraca 1 bracaa 2 racaab 3 acaabr 4 caabra 5 aabrac Cyclically shifts Sort L=caraab Index=1 S=abraca Output
10
Reversible Procedures of BWT
Occ(i) means occurrences of L[i] in the prefix L[0…i-1] C[c] means numbers of char c which has a lower order LF[5]=C[a]+Occ(5)=1+2=3 It will be critical to understand the following two properties of matrix M. For simplicity, for any character, it have the same relative position in F and L. let us see an example: How to compute the LF array? index sort results $abraca a$abrac abraca$ aca$abr braca$a ca$abra raca$ab L[5]== M[?] Occ(5) C[a] For the i-th row of M, the last character L[i] precedes the first character F[i] in the original string S, namely …L[i]F[i] and F is the first column of M and can be obtained by sorting L. Last-to-First mapping (LF-mapping). Let L[i] =c and let ri be the number of occurrences of c in the prefix L[0,i-1]. Let M[j] be the ri-th row of the M starting with c. Then the character F[j] in the first column corresponds to L[i] in the last column and set LF-mapping array LF[i]=j, meaning that F[j] and L[i] are the same character in original string. index sorting result 0 $abraca 1 $abrac 2 braca$ 3 ca$abr 4 braca$a 5 ca$abra raca$ab aa a three rows are still in order three rows are in order L[5]== M[3]
11
Reversible Procedures of BWT
Algorithm BWT_reverse(L) 1. i=0; //M[0]=$S 2. for j=N-1 to 0 3. S[j]=L[i]; 4. i=Occ[i]+C[L[i]]; //compute the LF-mapping array S’=abraca$ I=0 a ca aca raca braca abraca 0 $abraca $abraca $abraca $abraca $abraca $abraca $abraca a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr braca$a braca$a braca$a braca$a braca$a braca$a braca$a ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab S[5]=a S[4]=c S[3]=a S[2]=r S[1]=b S[0]=a End! End with meeting$
12
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based on BWT What is FM-index FM-index and Compression based on BWT Experiment results Conclusion
13
Compression based on BWT
BWT itself don’t compress data, so compression based on BWT combined the BWT with currently compression techniques. BWT GST RLE EC Input data Output data Move to Front(MTF) Run length Encoding(RLE) Huffman coding/ Entropy coding
14
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
15
FM-index based on BWT Step 2:Locating the occurrences
Algorithm counting(P[0,p-1]) c=P[p-1], i=p-1; sp=C[c], ep=C[c+1]-1; while((sp≤ep)&&(i≥1)) do c=P[i-1]; sp=C[c]+Occ(c,sp); ep=C[c]+Occ(c,ep); i=i-1; if(ep<sp) return 0; else return ep-sp+1; C[c] means the number of char which has a lower order than c. Occ(c,i) means the occurrences of c in prefix L[0…i-1]. In the i-th iteration: sp points to the start postion of pattern P[i, p-1]. ep points to the end postion of pattern P[i, p-1] Step 2:Locating the occurrences How FM-index Works? What is FM-index? S’=abraca$ P=aca sp ep aca suffix array 0 $abraca $abraca 1 a$abrac a$abrac 2 abraca$ abraca$ 3 aca$abr aca$abr 4 braca$a braca$a 5 ca$abra ca$abra 6 raca$ab raca$ab pos(6)=2 LF-mapping M[6] is a marked row. Set pos(3)=pos(6)+1=3 Full-text ,minute-space index. S=abraca P=aca sp ep aca aca aca 0 $abraca $abraca $abraca 1 a$abrac a$abrac a$abrac 2 abraca$ abraca$ abraca$ 3 aca$abr aca$abr aca$abr 4 braca$a braca$a braca$a 5 ca$abra ca$abra ca$abra 6 raca$ab raca$ab raca$ab sp=C[c]+Occ(c,1)=5+0=5 ep=C[c]+Occ(c,3) =5+0=5 sp=C[a]+Occ(a,5)=1+2=3 ep=C[a]+Occ(a,5)=1+2=3 FM-index consists two steps: 1)Counting the occurences of te matching pattern 2)Locating the occurrences. FM-index combines the BWT-based compression algorithm with suffix array data structure, and achieves effective random accesses to the compressed data without uncompressing all of them at query time.
16
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
17
Relationship BWT GST RLE0 EC Compression algorithm based on BWT
Auxiliary information Partition the BWT result into buckets FM-index FM-index based on BWT Input data Output data
18
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
19
Experiment results We compare several tools which is widely used, including gzip (v1.2.4), szip(v1.12a),bzip2(v 1.0.6), bicom(v 1.01). Bzip2 and szip are based on BWT, gzip is based on LZ77 . bicom is based on PPM. the result is showed Below. File File size Bicom Szip Bzip2 gzip Large.txt 4,047,392 1.69 1.63 1.67 2.35 E.Coli 4,638,690 2.12 2.02 2.16 2.31 World192.txt 2,473,400 1.44 1.60 1.58 2.34
20
Experiment results Read length Program CPU time
Peak memory (megabytes) Speed- up Reads aligned 36 bp Bowtie m15s , Maq h52m26s x Bowtie –v 2 4m55s , SOAP h44m3s , x 50 bp Bowtie m11s , Maq h39m56s x Bowtie –v 2 5m32s , SOAP h42m4s , x 76 bp Bowtie m58s , Maq h45m7s , x Bowtie –v m35s , SOAP do not support
21
Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion
22
Conclusion BWT is a data transformation method.
Both Compression techniques and FM-Index based on BWT achieve good results at low time cost. Dynamic FM-index will be an interest topic.
23
Thanks The end! Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.