Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.

Slides:



Advertisements
Similar presentations
DAT2343 File Storage and Access © Alan T. Pinck / Algonquin College; 2003.
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Processing of large document collections
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Memory Management Design & Implementation Segmentation Chapter 4.
BTrees & Bitmap Indexes
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
A Data Compression Algorithm: Huffman Compression
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
CS336: Intelligent Information Retrieval
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Data Compression Basics & Huffman Coding
XML Compression Aslam Tajwala Kalyan Chakravorty.
Chapter 7 Special Section Focus on Data Compression.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Storage Systems CSE 598d, Spring 2007 Lecture 5: Redundant Arrays of Inexpensive Disks Feb 8, 2007.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Data Compression1 File Compression Huffman Tries ABRACADABRA
CS-2852 Data Structures LECTURE 13B Andrew J. Wozniewicz Image copyright © 2010 andyjphoto.com.
CSCE Database Systems Chapter 15: Query Execution 1.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Operating Systems (CS 340 D) Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Chapter 3 Data Representation. 2 Compressing Files.
Storage and File structure COP 4720 Lecture 20 Lecture Notes.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
CS 101 – Sept. 11 Review linear vs. non-linear representations. Text representation Compression techniques Image representation –grayscale –File size issues.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
HUFFMAN CODES.
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
The Greedy Method and Text Compression
Chapter 7 Special Section
Optimal Merging Of Runs
Lecture 11: DMBS Internals
Advanced Algorithms Analysis and Design
Huffman Coding CSE 373 Data Structures.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Chapter 7 Special Section
Huffman Coding Greedy Algorithm
Lecture 20: Representing Data Elements
Presentation transcript:

Data Compression By, Keerthi Gundapaneni

Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large number of compression schemes currently in the market have been based on character encoding or on detection of repetitive string. Many of these schemes achieve data reduction rates to bits per character for English text.

Database performance strongly depends a great deal on the amount of available memory. Important to try and use the available memory as efficiently as possible. Introduction

Current Schemes Text compression schemes based on letter frequency. (pioneered by Huffman) Schemes based on string matching. Schemes based on fast implementation of algorithms, parallel algorithms and VLSI implementations. Many database uses prefix and postfix- truncation to save space and increase the fan-out of nodes, e.g. starburst.

Using various schemes Compression rates of dataset depends on the attribute type and value distribution. It is difficult to compress binary floating point numbers but relatively easy to compress English test by a factor of 2 or 3. Optimal performance can only be obtained by judicious decisions which attributes to compress and which compression method to use.

Advantages of Compression Reduce disk space required. Seek distance and Seek times are reduced. More data fits into each disk page, track and cylinder allowing more intelligent clustering of related objects into physically near locations. Unused disk space can be used for shadowing to increase reliability

Advantages of Compression Compressed data can be transferred faster to and from disk. Data compression increases disk bandwidth. Due to the information density there is a decrease in the load there for less I/O bottleneck. Faster transfer rates across the network.

Advantages of Compression Retaining more data in compression from in the I/O buffer allows more records to remain in the buffer, thus increases the buffer hit rate and reducing the number of I/Os. The log recorders can become shorter.

Types of compression For a given table of “parts” the attribute “color” is replaced by a small integer, save the encoding in a separate relation, and join the larger table with the relatively small encoding table for queries that require string-values output of the color attribute. Since such encoding tables are typically small e.g. a few kilobytes, efficient hash-based algorithms can be used for the join.

Huffman code example Symbol : A B C D E Frequency: Total 186 bit (with 3 bit per code word)

Huffman code example

Results SymbolFrequencyCodeCode LengthTotal Length A2401 B C D E Initial. 186 bit Final. 138 bit (3 bit code)

References: Seeck, Roger (2008). Binary Essence. Retrieved April 17, 2008, from About BinaryEssence Web site: Graefe, Author's first name initialG, & Shapiro, L (1991). ACM/IEEE- CS Symp. Data Compression and Database Performance. 1, 1-10.