Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University

Nov. 18, 20032 Agenda Introduction Introduction What is data compression? What is data compression? Compression Algorithms Compression Algorithms Burrows-Wheeler Transform (BWT) Burrows-Wheeler Transform (BWT) Parallelizing BWT Parallelizing BWT Parallel BZIP2 Parallel BZIP2 Conclusion Conclusion Questions Questions

Nov. 18, 20033 Introduction A general purpose compressor that can take advantage of parallel computing should greatly reduce the amount of time it requires to compress and uncompress files. A general purpose compressor that can take advantage of parallel computing should greatly reduce the amount of time it requires to compress and uncompress files. Goal is to modify the popular BZIP2 compression utility to support parallel processing in hopes that it will increase data compression performance. Goal is to modify the popular BZIP2 compression utility to support parallel processing in hopes that it will increase data compression performance.

Nov. 18, 20034 What is data compression? Compression is used to compact files or data into a smaller form. Compression is used to compact files or data into a smaller form. Lossless data compression requires that when the data is uncompressed again, it must be identical to the original. Lossless data compression requires that when the data is uncompressed again, it must be identical to the original. Original File Compressed File Uncompressed/Original File

Nov. 18, 20035 Common compression programs Some common compression programs used in Unix and Windows are: ACE, BZIP2, GZIP, RAR, and ZIP Some common compression programs used in Unix and Windows are: ACE, BZIP2, GZIP, RAR, and ZIP

Nov. 18, 20036 Compression Algorithms The two general types of compression algorithms are: dictionary based and statistical. The two general types of compression algorithms are: dictionary based and statistical. Dictionary algorithms (such as Lempel-Ziv) build dictionaries of strings and replace entire groups of symbols. Dictionary algorithms (such as Lempel-Ziv) build dictionaries of strings and replace entire groups of symbols. The statistical algorithms develop models on the statistics of the input data and use those models to control the final output. The statistical algorithms develop models on the statistics of the input data and use those models to control the final output.

Nov. 18, 20037 Dictionary Algorithms Strings of characters are replaced by “tokens” to reduce the size of the data. Strings of characters are replaced by “tokens” to reduce the size of the data. The dictionary contains the strings that the tokens represent. The dictionary contains the strings that the tokens represent. the frog jumped on the log. # fr@ jumped on # l@. Dictionary # = the @ = og

Nov. 18, 20038 Burrows-Wheeler Transform The BWT is a block-sorting statistical compression algorithm. The BWT is a block-sorting statistical compression algorithm. BWT achieves speeds similar to dictionary based algorithms. BWT achieves speeds similar to dictionary based algorithms. BWT achieves compression performance within a few percent of the best statistical compressors (PPM). BWT achieves compression performance within a few percent of the best statistical compressors (PPM).

Nov. 18, 20039 BWT Algorithm BWT works in three stages: sorting, move-to- front, and final compression BWT works in three stages: sorting, move-to- front, and final compression Initial sorting stage permutes the input text so similar contexts are grouped together Initial sorting stage permutes the input text so similar contexts are grouped together Move-To-Front stage converts local symbol groups into a single global structure Move-To-Front stage converts local symbol groups into a single global structure Final compression stage takes advantage of transformed data to produce efficient compressed output Final compression stage takes advantage of transformed data to produce efficient compressed output

Nov. 18, 200310 BWT (Sort) String S (N characters long) form a matrix with N cyclic shifts of S String S (N characters long) form a matrix with N cyclic shifts of S Matrix is sorted lexicographically Matrix is sorted lexicographically New string L is formed from last column of matrix and I is index of the row in the matrix with the match. New string L is formed from last column of matrix and I is index of the row in the matrix with the match. 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab S = abraca L = caraab I = 1

Nov. 18, 200311 BWT (Move-To-Front) The Move-To-Front step defines a vector of integers R which represent codes for the string L. The Move-To-Front step defines a vector of integers R which represent codes for the string L. A list Y is then created containing the alphabet of L. A list Y is then created containing the alphabet of L. R is created by setting R[i] to be # of characters preceding L[i] in Y. L[i] is then moved to the front of Y. R is created by setting R[i] to be # of characters preceding L[i] in Y. L[i] is then moved to the front of Y. L = caraab I = 1 Y = a, b, c, r Y = c, a, b, rY = a, c, b, r L[0] = c Y = a, b, c, r R[0] = 2 L[1] = a Y = c, a, b, r R[1] = 1

Nov. 18, 200312 BWT (Final Compression) The final R vector along with I is then compressed using Huffman or some other coding technique. The final R vector along with I is then compressed using Huffman or some other coding technique. Each element in R is treated as a separate token to be coded. Each element in R is treated as a separate token to be coded. Huffman Encoded R = (2 1 3 1 0 3) I = 1

Nov. 18, 200313 Parallelizing BWT Couple of options for parallelizing BWT. Couple of options for parallelizing BWT. Data to be compressed is broken into blocks before BWT is run. Each block can be independently processed and therefore run in parallel. Data to be compressed is broken into blocks before BWT is run. Each block can be independently processed and therefore run in parallel. Blocks are stitched back together at the end. Blocks are stitched back together at the end. Data To Compress Data 1Data 2Data 3 R, I Compressed Huff

Nov. 18, 200314 Parallelizing BWT The matrix in Sort stage needs to be sorted lexicographically The matrix in Sort stage needs to be sorted lexicographically Could use use a parallel sort algorithm to achieve speedup Could use use a parallel sort algorithm to achieve speedup 0aabrac 1abraca 2acaabr 3bracaa 4caabra 5racaab S = abraca CPU 1CPU 2CPU n

Nov. 18, 200315 BZIP2 BZIP2 is popular compression utility in Unix used as replacement for GZIP BZIP2 is popular compression utility in Unix used as replacement for GZIP BZIP2 uses BWT algorithm and is available free with source code BZIP2 uses BWT algorithm and is available free with source code BZIP2 compresses single files so often used with TAR (ie: kernel-2.4.21.tar.bz2) BZIP2 compresses single files so often used with TAR (ie: kernel-2.4.21.tar.bz2) BZIP2 works in a sequential manner BZIP2 works in a sequential manner

Nov. 18, 200316 Parallel BZIP2 Modify BZIP2 to process BWT blocks in parallel Modify BZIP2 to process BWT blocks in parallel Use pthread model for SMP parallel computing Use pthread model for SMP parallel computing Lots of 2 & 4 CPU, and P4 Hyperthreaded machines available. Lots of 2 & 4 CPU, and P4 Hyperthreaded machines available. Data To Compress CPU 1 CPU 2 Data 1R, IHuff Data 2R, IHuff Data 3R, IHuff Data 4R, IHuff

Nov. 18, 200317 Conclusion Parallelizing BZIP2 should provide speedup and increase its performance. Parallelizing BZIP2 should provide speedup and increase its performance. Once code is complete, testing will be performed to see if this is true and by how much. Once code is complete, testing will be performed to see if this is true and by how much. Compressing/uncompressing large amounts of data (ie: Linux kernel source) takes a lot of time and speeding up the process for people who have SMP machines should be useful. Compressing/uncompressing large amounts of data (ie: Linux kernel source) takes a lot of time and speeding up the process for people who have SMP machines should be useful.

Nov. 18, 200318 Questions What data compression algorithm does BZIP2 use? What data compression algorithm does BZIP2 use? How does the algorithm’s speed & compression compare to dictionary and other statistical algorithms? How does the algorithm’s speed & compression compare to dictionary and other statistical algorithms? What parallel computing model is being used in the modified BZIP2? What parallel computing model is being used in the modified BZIP2?

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

Similar presentations

Presentation on theme: "Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

Similar presentations

Presentation on theme: "Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University."— Presentation transcript:

Similar presentations

About project

Feedback