An Overview of Different Compression Algorithms Their application on compressing inverted files.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Lecture 10: Dictionary Coding
Processing of large document collections
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Lempel-Ziv Compression Techniques
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Assembly 2005, Helsinki, July Crinkler - compressing Windows 4k intros to EXE files Aske Simon Christensen Rune L. H. Stubbe.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Indexing and Searching
Chapter 7 Special Section Focus on Data Compression.
Lossless Compression Multimedia Systems (Module 2 Lesson 3)
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Lecture 10 Data Compression.
Source Coding-Compression
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Shift-based Pattern Matching for Compressed Web Traffic Author: Anat Bremler-Barr, Yaron Koral,Victor Zigdon Publisher: IEEE HPSR,2011 Presenter: Kai-Yang,
Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
LZRW3 Decompressor dual semester project Characterization Presentation Students: Peleg Rosen Tal Czeizler Advisors: Moshe Porian Netanel Yamin
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Efficient Processing of Multi-Connection Compressed Web Traffic Yaron Koral 1 with: Yehuda Afek 1, Anat Bremler-Barr 1 * 1 Blavatnik School of Computer.
LZRW3 Decompressor dual semester project Part A Mid Presentation Students: Peleg Rosen Tal Czeizler Advisors: Moshe Porian Netanel Yamin
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Instruction Sets: Addressing modes and Formats Group #4  Eloy Reyes  Rafael Arevalo  Julio Hernandez  Humood Aljassar Computer Design EEL 4709c Prof:
Hanyang University Hyunok Oh Energy Optimal Bit Encoding for Flash Memory.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Lempel-Ziv-Welch Compression
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
LZW (Lempel-Ziv-welch) compression method The LZW method to compress data is an evolution of the method originally created by Abraham Lempel and Jacob.
Computer Sciences Department1. 2 Data Compression and techniques.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Burrows-Wheeler Transformation Review
Information Retrieval in Practice
COMP261 Lecture 22 Data Compression 2.
Data Coding Run Length Coding
Compression & Huffman Codes
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Lecture 7 Data Compression
Multimedia Outline Compression RTP Scheduling Spring 2000 CS 461.
Presenter: Cheng – Yeh Tsao
COMP261 Lecture 21 Data Compression.
Chapter 7 Special Section
Methodology of a Compiler that Compresses Code using Echo Instructions
Indexing and Searching (File Structures)
Multimedia Information Retrieval
EECE.4810/EECE.5730 Operating Systems
Language-Model Based Text-Compression
Lecture 22: Compressed Linear Algebra for Large Scale ML
Chapter 7 Special Section
CS703 - Advanced Operating Systems
一種兼顧影像壓縮與資訊隱藏之技術 張 真 誠 國立中正大學資訊工程學系 講座教授
Test Data Compression for Scan-Based Testing
CPS 296.3:Algorithms in the Real World
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

An Overview of Different Compression Algorithms Their application on compressing inverted files

Alternative Compression Algorithms Arithmetic coding Huffman coding Character-based Word-based Dictionary-based coding – Ziv-Lempel family of coding

Pros and Cons of Different Algorithms ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression ratio very goodpoorvery goodgood Compression speed slowfast very fast Decompression speed slowfastvery fast Memory spacelow highmoderate Pattern matchingnoyes Random Accessnoyes no

Choosing an Compression Algorithm for inverted files Factors need to be considered Compression ratio Speed Random access In modern IR system, Word-based Huffman coding is commonly used There are a lot of research on Ziv-Lempel family coding to see if they can be applied to indices compression

An Improved Sliding-window Ziv-Lempel Algorithm Conventional LZ family compression algorithms use a sliding window approach. Based on longest matching length (m-length) An improved sliding window LZ algorithm is proposed by Bender and Wolf. Instead of m-length, the improved algorithm is based on the offset of the length (o-length) and the differential of the length (  -length)

Benefits of the Improved Algorithm Better compression ratio in the experiment Still linear compression and searching: O(n). It didn’t really provide an LZ algorithm that support random access.

Another Modified LZ algorithm Proposed by Williams Use literal/copy item; Each step, transmit original if it is a literal item, a pointer if it is a copy item; Aimed at faster compression speed and smaller memory footprint. Better used in the embedded system where real- time compression is required. Inappropriate for index compression.

Conclusion Up to date, the best practical compression algorithm for index is still word-based Huffman coding. There are theoretical studies about Ziv- Lempel family coding. Non of them are practically applicable to our problem. But they can be used in other areas.

Reference An Improved Data Compression Algorithm Based on Ziv-Lempel Data Compression Algorithm, Paul Edward Bender and Jack Keil Wolf; An Extremely Fast Ziv-Lempel Data Compression Algorithm, Ross N. Williams; Modern Information Retrieval, Ricardo Baeza- Yates and Berthier Ribeiro-Neto;