An Overview of Different Compression Algorithms Their application on compressing inverted files
Alternative Compression Algorithms Arithmetic coding Huffman coding Character-based Word-based Dictionary-based coding – Ziv-Lempel family of coding
Pros and Cons of Different Algorithms ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression ratio very goodpoorvery goodgood Compression speed slowfast very fast Decompression speed slowfastvery fast Memory spacelow highmoderate Pattern matchingnoyes Random Accessnoyes no
Choosing an Compression Algorithm for inverted files Factors need to be considered Compression ratio Speed Random access In modern IR system, Word-based Huffman coding is commonly used There are a lot of research on Ziv-Lempel family coding to see if they can be applied to indices compression
An Improved Sliding-window Ziv-Lempel Algorithm Conventional LZ family compression algorithms use a sliding window approach. Based on longest matching length (m-length) An improved sliding window LZ algorithm is proposed by Bender and Wolf. Instead of m-length, the improved algorithm is based on the offset of the length (o-length) and the differential of the length ( -length)
Benefits of the Improved Algorithm Better compression ratio in the experiment Still linear compression and searching: O(n). It didn’t really provide an LZ algorithm that support random access.
Another Modified LZ algorithm Proposed by Williams Use literal/copy item; Each step, transmit original if it is a literal item, a pointer if it is a copy item; Aimed at faster compression speed and smaller memory footprint. Better used in the embedded system where real- time compression is required. Inappropriate for index compression.
Conclusion Up to date, the best practical compression algorithm for index is still word-based Huffman coding. There are theoretical studies about Ziv- Lempel family coding. Non of them are practically applicable to our problem. But they can be used in other areas.
Reference An Improved Data Compression Algorithm Based on Ziv-Lempel Data Compression Algorithm, Paul Edward Bender and Jack Keil Wolf; An Extremely Fast Ziv-Lempel Data Compression Algorithm, Ross N. Williams; Modern Information Retrieval, Ricardo Baeza- Yates and Berthier Ribeiro-Neto;