Download presentation
Presentation is loading. Please wait.
Published byBelinda Lawson Modified over 9 years ago
1
Data Compression By, Keerthi Gundapaneni
2
Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large number of compression schemes currently in the market have been based on character encoding or on detection of repetitive string. Many of these schemes achieve data reduction rates to 2.3-2.5 bits per character for English text.
3
Database performance strongly depends a great deal on the amount of available memory. Important to try and use the available memory as efficiently as possible. Introduction
4
Current Schemes Text compression schemes based on letter frequency. (pioneered by Huffman) Schemes based on string matching. Schemes based on fast implementation of algorithms, parallel algorithms and VLSI implementations. Many database uses prefix and postfix- truncation to save space and increase the fan-out of nodes, e.g. starburst.
5
Using various schemes Compression rates of dataset depends on the attribute type and value distribution. It is difficult to compress binary floating point numbers but relatively easy to compress English test by a factor of 2 or 3. Optimal performance can only be obtained by judicious decisions which attributes to compress and which compression method to use.
6
Advantages of Compression Reduce disk space required. Seek distance and Seek times are reduced. More data fits into each disk page, track and cylinder allowing more intelligent clustering of related objects into physically near locations. Unused disk space can be used for shadowing to increase reliability
7
Advantages of Compression Compressed data can be transferred faster to and from disk. Data compression increases disk bandwidth. Due to the information density there is a decrease in the load there for less I/O bottleneck. Faster transfer rates across the network.
8
Advantages of Compression Retaining more data in compression from in the I/O buffer allows more records to remain in the buffer, thus increases the buffer hit rate and reducing the number of I/Os. The log recorders can become shorter.
9
Types of compression For a given table of “parts” the attribute “color” is replaced by a small integer, save the encoding in a separate relation, and join the larger table with the relatively small encoding table for queries that require string-values output of the color attribute. Since such encoding tables are typically small e.g. a few kilobytes, efficient hash-based algorithms can be used for the join.
10
Huffman code example Symbol : A B C D E Frequency: 24 12 10 8 8 Total 186 bit (with 3 bit per code word)
11
Huffman code example
12
Results SymbolFrequencyCodeCode LengthTotal Length A2401 B12100336 C10101330 D8110324 E8111324 Initial. 186 bit Final. 138 bit (3 bit code)
13
References: Seeck, Roger (2008). Binary Essence. Retrieved April 17, 2008, from About BinaryEssence Web site: http://www.binaryessence.com/dct/en000081.htm Graefe, Author's first name initialG, & Shapiro, L (1991). ACM/IEEE- CS Symp. Data Compression and Database Performance. 1, 1-10.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.