Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.

Slides:



Advertisements
Similar presentations
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Advertisements

Data Compression CS 147 Minh Nguyen.
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lecture04 Data Compression.
SWE 423: Multimedia Systems
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Lossless Data Compression Using run-length and Huffman Compression pages
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
CSE Lectures 22 – Huffman codes
Chapter 2 Source Coding (part 2)
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Modern Information Retrieval Chapter 7: Text Processing.
Source Coding-Compression
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Huffman Encoding Veronica Morales.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Communication Technology in a Changing World Week 2.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Additive White Gaussian Noise
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Multi-media Data compression
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
An Overview of Different Compression Algorithms Their application on compressing inverted files.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
3.3 Fundamentals of data representation
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Data Coding Run Length Coding
Data Compression.
Data Compression.
Algorithms for iSNE Dr. Kenneth Cosh Week 13.
Data Compression CS 147 Minh Nguyen.
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Analysis & Design of Algorithms (CSCE 321)
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Presentation transcript:

Text Operations: Coding / Compression Methods

Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated with space requirements, I/O overhead, communication delays –obstacle: need for IR systems to access text randomly to access a given word in some forms of compressed text, the entire text must be decoded from the beginning until the desired word is reached Two strategies –statistical methods –dictionary methods

Statistical Methods Basic concepts –Modeling: a probability is estimated for each symbol –Coding: a code is assigned to each symbol based on the model –shorter codes are assigned to the most likely symbols Relationship between probabilities and codes –Source code theorem (by Claude Shannon) : a symbol that occurs with probability p should be assigned a code of length log 2 (1/p) bits

Statistical Methods Compression models –adaptive model: progressively learn about the statistical distribution as the compression process goes on decompression of a file has to start from its beginning –static model: assume an average distribution for all input texts poor compression ratios when data deviates from initial distribution assumptions –semi-static model: learn a distribution in a first pass, compress the data in a second pass by using a fixed code derived from the distribution learned information on the data distribution must be stored

Huffman Coding Building a Huffman tree –for each symbol of the alphabet, create a node containing the symbol and its probability –the two nodes with the smallest probabilities become children of a newly created parent node –the parent node is associated a probability equal to the sum of the probabilities of the two chosen children –the operation is repeated, ignoring nodes that are already children, until there is only one node

Huffman Coding Example: “for each rose, a rose is a rose” rose is“, “ a eachfor for each rose, a rose is a rose

Canonical Huffman Coding Height of left tree is never shorter than right tree. S: ordered sequence of pairs (x i, y i ) for each level in tree where x i = # symbols y i = numerical value of first symbol rose is“, “ a eachfor for each rose, a rose is a rose S = ((1, 1) (1, 1) (0, ∞) (4, 0)

Byte-Oriented Huffman Coding Tree has branching factor of 256 Ensure no empty nodes in higher levels of tree # of bottom level elements = 1 + ((v – 256) mod 255) Charactertistics –Decompression is faster than for plain Huffman coding. –Compression ratios are better than for Ziv-Lempel family of codings. –Allows direct searching on compressed text.

Dictionary Methods Basic concepts –replacing groups of consecutive symbols with a pointer to an entry in a dictionary –the pointer representations are references to entries in a dictionary composed of a list of symbols that are expected to occur frequently –pointers to the dictionary entries are chosen so that they need less space than the phrase they replace –modeling and coding does not exist –there are no explicit probabilities associated to phrases

Dictionary Methods Static dictionary methods –selected pairs of letters are replaced with codewords –ex) Digram coding at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary if so, they are coded together and the coding position is shifted by two characters; otherwise, the single character is represented by its normal code and the position is shifted by one character –main problem the dictionary might be suitable for one text, but unsuitable for another

Dictionary Methods Adaptive dictionary methods –Ziv-Lempel placing strings of characters with a reference to a previous occurrence of the string if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces, compression is achieved

Ziv-Lempel Code Characteristics –identifying each text segment the first time it appears and then simply pointing back to this first occurrence rather than repeating the segment –an adaptive model of coding, with increasingly long text segments encoded as the text is scanned –require less space than the repeated text segments –higher compression than the Huffman codes –codes of roughly 4 bits per character

Ziv-Lempel Code LZ77 – Gzip encoding –the code consists of a set of triples a: identifies how far back in the decoded text to look for the upcoming text segment b: tells how many characters to copy for the upcoming segment c: a new character to add to complete the next segment ex),

Ziv-Lempel Code An example (decoding) p pe pet peter peter_ peter_pi peter_piper peter_piper_pic peter_piper_pick peter_piper_picked ………..

Dictionary Methods Adaptive dictionary methods –Disadvantages over the Huffman method no random access: does not allow decoding to start in the middle of a compression file Dictionary schemes are popular for their speed low memory use, but statistical methods are more common in an IR environment

Inverted File Compression Inverted file is a –Vector of all words in collection (vocabulary) –For each word List of documents that include that word Assuming list includes documents in ascending order –Can be compressed by storing size of gaps rather than document numbers Unary code: (x-1) 1 bits followed by a zero bit

Golumb Number Compression Value x encoded as: –q + 1 in unary, where q = floor ((x – 1) / b) –followed by –r in binary, where r = (x – 1) – q * b –b is set based on size and distribution of numbers being encoded For gap compression –b =.69 *(N / f t ) where –N is the total number of documents –f t is the number of documents containing term t Implies that compression coding varies for each term

Comparing Text Compression Techniques ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression Ratio very goodpoorvery goodgood Compression Speed slowfast very fast Decompression Speed slowfastvery fast Memory Spacelow highmoderate Compressed pattern matching noyes Random Access noyes no