Presentation is loading. Please wait.

Presentation is loading. Please wait.

ItCompress: An Iterative Semantic Compression Algorithm

Similar presentations


Presentation on theme: "ItCompress: An Iterative Semantic Compression Algorithm"— Presentation transcript:

1 ItCompress: An Iterative Semantic Compression Algorithm
H. V. Jagadish U. Michigan Ann Arbor R. T. Ng U. Of British Columbia Vancouver B. C. Ooi A. K. H. Tung Natl’ U. of Singapore Singapore

2 Motivation query Large Data Sets results
Ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries Concern: compress as much as possible.

3 Conventional Compression Method
Try to find the optimal encoding of arbitrary strings for the input data: Huffman Coding Lempel-Ziv Coding (gzip) View the whole table as a large byte string Statistical or dictionary based Operate at the byte level

4 Why not just “syntactic”?
Do not exploit the complex dependency patterns in the table Individual retrieval of tuple is difficult Do not utilize lossy compression

5 Outline Motivation Semantic Compression Methods ItCompress Algorithm
Performance Conclusion Q&A

6 Semantic compression methods
Derive a descriptive model M Identify the data values which can be derived from M (within some error tolerance), which are essential for deriving, and which are the outliers Derived values need not to be stored, only the outliers need

7 Advantages More Complex Analysis Fast Retrieval Query Enhancement
Example: detect correlation among columns Fast Retrieval Tuple-wise access Query Enhancement Possible to answer query directly from discover semantic Compress in way which enhanced answering of some complex queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C. Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004. Choose a combination of compression methods based on semantic and syntactic information

8 Protocol Duration Bytes Packets
Fascicles Key observation Often, numerous subsets of records in T have similar values for many attributes Compress data by storing representative values (e.g., “centroid”) only once for each attribute cluster Protocol Duration Bytes Packets http K http K http K http K http K ftp K ftp K ftp K Lossy compression: information loss is controlled by the notion of “similar values” for attributes (user-defined)

9 A compact CaRT can eliminate an entire column by prediction
SPARTAN: Semantic Compression with Classification and Regression Trees (CaRTs) error=0 error<=3 error = 0 Protocol = http Protocol = ftp yes no Packets > 10 Bytes > 60K protocol http ftp duration 12 16 15 19 26 27 32 18 bytes 20K 24K 40K 58K 100K 300K 80K packets 3 5 8 11 18 24 35 15 error <= 3 Duration = 15 Duration = 29 yes no Packets > 16 A compact CaRT can eliminate an entire column by prediction Outlier: Packets=11, Duration = 19

10 ItCompress: Compression Format
Representative Rows (Patterns) Original Table RRid age salary credit sex 1 30 90k good F 2 70 35k poor M age salary credit sex 20 30k poor M 25 76k good F 30 90k 40 100k 50 110k 60 50k 70 35k 75 15k Compressed Table RRid bitmap Outlying value 2 0111 20 1 1111 0100 40, poor, M 50 0010 60, 50k, M 1110 F Error Tolerance: age salary credit sex 5 25k

11 Some definitions Error tolerance Numeric attributes
The upper bound that x’ can be different from x x ∈ [ x’-ei, x’+ei ] Categorical attributes The upper bound on the probability that the compressed value differs from actual value Given an actual value x and its error tolerance ei, the compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei

12 Some definitions Coverage Total coverage
Let R be a row in the table T, and Pi be a pattern The coverage of Pi on R : Total coverage Let P be a set of patterns P1,…,Pk; and the table T contains n rows R1,…,Rn

13 ItCompress: basic algorithm
First randomly choose k rows as initial patterns Scan the table T: For each row R, compute the coverage of each pattern on it, then try to find Pmax(R) Allocate R to its most covered pattern After each iteration, re-compute all patterns’ attributes, always using the most frequent values Iterate until sum of total coverage does not increase Phase1 Phase2

14 Example: the 1st iteration begins
RRid age salary credit sex 1 20 30k poor M 2 25 76k good F age salary credit sex 20 30k poor M 25 76k good F 30 90k 40 100k 50 110k 60 50k 70 35k 75 15k Error Tolerance: age salary credit sex 5 25k

15 Example: Phase 1 RRid age salary credit sex 1 20 30k poor M 2 25 76k
good F age salary credit sex 20 30k poor M 25 76k good F 30 90k 40 100k 50 110k 60 50k 70 35k 75 15k age salary credit sex 20 30k poor M 40 100k 60 50k good 70 35k F 75 15k age salary credit sex 25 76k good F 30 90k 50 110k Error Tolerance: age salary credit sex 5 25k

16 Example: Phase 2 RRid age salary credit sex 1 20 30k poor M 2 25 76k
good F age salary credit sex 20 30k poor M 25 76k good F 30 90k 40 100k 50 110k 60 50k 70 35k 75 15k 70 30k poor M 25 90k good F age salary credit sex 20 30k poor M 40 100k 60 50k good 70 35k F 75 15k age salary credit sex 25 76k good F 30 90k 50 110k Error Tolerance: age salary credit sex 5 25k

17 Convergence(I) Phase 1: Phase 2:
When we assign the rows to their most coverage patterns: For each row, the coverage increases or maintain So the total coverage also increases or maintain Phase 2: When we re-compute the attribute values for the patterns: For each pattern, the coverage increases or maintains So the total coverage also increases or maintains

18 Convergence(II) In both Phase 1&2, the total coverage is either increased or maintained, and it has a obvious upper bound (cover the whole table) The algorithm will converge eventually

19 Complexity Phase 1: Phase 2: The total time complexity is O(kmnl+kdl)
In l iterations, we need to go through the n rows in the table and match each row against the k patterns(2m comparisons,) The running time complexity is O(kmnl) where m is the number of attributes Phase 2: Computing each new pattern Pi will require going through all the domain values/intervals of each value Assuming the total number of domain values/intervals is d, the running time complexity is O(kdl) The total time complexity is O(kmnl+kdl)

20 Advantages of ItCompress
Simplicity and Directness Two phases process of Fascicle and Spartan Find rules/patterns Compress database using discovered rules/patterns ItCompress optimize the compression directly without finding rules/patterns that may not be useful (a.k.a microeconomic approach) Less constraints Do not need patterns to be matched completely or rules that apply globally Easily tuned parameters

21 Performance Comparison
Algorithms ItCompress, ItCompress+gzip Fascicles, Fascicles+gzip SPARTAN+gzip Platform ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory) Datasets Corel: 32 numeric attributes, rows, 10.5MB Census: 7 numeric, 7 categorical, rows, 28.6MB Forest-cover: 10 numeric, 44 categorical, rows, 75.2MB

22 Effectiveness (Corel)

23 Effectiveness (Census)

24 Effectiveness (Forest Cover)

25 Efficiency

26 Varying k

27 Varying Sample Ratio

28 Adding Noises (Census)

29 Effect of Corruption 20% Corruption? A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

30 Effect of Corruption 20% Corruption? A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

31 Findings ItCompress is More efficient than SPARTAN
More effective than Fascicles Insensitive to parameter setting Robust to noises

32 Other Semantic Compression Algorithms
Future work Can we perform mining on the compressed datasets using only the patterns and the bitmap ? Example: Building Bayesian Belief Network Is ItCompress a good “bootstrap” semantic compression algorithm ? ItCompress Compressed database database Other Semantic Compression Algorithms

33 Reference List S. Babu, M. Garofalakis, and R. Rastogi. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. ACM SIGMOD, 2001. H.V. Jagadish, J. Madar, R.T. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB, 1999. H.V. Jagadish, R.T. Ng, B.C. Ooi, Anthony K.H. Tung. ItCompress: An Iterative Semantic Compression Algorithm. ICDE 2004. J. Ziv, A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Trans, 1977.

34 Thanks you!


Download ppt "ItCompress: An Iterative Semantic Compression Algorithm"

Similar presentations


Ads by Google