Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio
Presentation Outline Motivation Goal Approximate Bitmaps (AB) encoding AB example Theoretical analysis Experiments and Results Conclusion
Motivation Bitmap indices Data warehouses Scientific data Visualization applications Bitwise operations Bitmap Compression Run-length encoders Word Aligned Hybrid (WAH) Byte-aligned Bitmap Code (BBC)
Motivation The row numbers do not longer correspond to the bit position in the bitmap Queries over few particular rows As expensive as queries asking for all the rows Commonly, users are only interested in a small subset of the dataset at a time. For example: A query over the transactions of the last 7 days Spatial queries over objects in a specific geographical area
Motivation Visualization applications Millions of different readings ordered by their geographic location Users ask range queries over some of the readings for a given area The answers are highlighted in the screen Several degrees of resolution make approximate answers acceptable
Our Goal Enable direct access over any subset of the bitmap Achieve effective compression Maintain bitwise operations for query execution Trade-off efficiency vs. accuracy No false negatives
The approach Our solution is inspired by Bloom Filters A 2 m bit array indexed using k independent hash functions A data object is inserted by setting the k positions in the array corresponding to the hash values of the object False positives can happen, but false negatives cannot
Approximate Bitmaps (AB) A bloom filter-like structure Only the set bits are inserted into the AB Three levels of encoding: Per table, per attribute, per bitmap column Parameters: The hash string mapping function, F The k hash functions, {H 1 (x),…,H k (x)} The size of the AB, n = αs = 2 m Precision in terms of α and k, ~(1-(1-e -k/α ) k )
AB Example A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories. Bitmap Table Size: 72 bits Number of set bits = 24. F(i,j) = concatenate(i,j) = x H 1 (x) = x mod 32 m = 5 AB Size: 2 5 = 32 bits
AB Example - Insertion Initially all bits in the AB are zero To insert set bit in (1,1) A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C
AB Example - Insertion A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C To insert set bit in (1,1) x = 11 H(11) = 11 mod 32 = 11 AB(11) = 1
AB Example - Insertion To insert set bit in (5,4) x = 54 H(54) = 54 mod 32 = 22 AB(22) = A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C
AB Example - Insertion After all insertions A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C
AB Example - Analysis The underlined positions are false positives Only 8 out of the 48 zeros are set in the AB A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C Estimated Precision: α = ABSize/Set Bits α = 32/24 = 1.33 k = 1 FP = (1-e -k/α ) P = 1-FP P = 1-(1-e -1/1.33 ) P = 47%
AB Example - Retrieval Consider this query, asking for 4 rows A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C This a range query over 4 rows, where the third attribute falls into C1 or C2 Row 4: (4,7): H(47) = 15 AB(15)=0 (4,8): H(48) = 16 AB(16)=1 Row 5: (5,7): H(57) = 25 AB(25)=1 Stop
AB Example - Retrieval Consider this query, asking for 4 rows A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C Row 6: (6,7): H(67) = 3 AB(67)=1 Stop Approx Query Answer: {1,1,1,0} Exact Answer: {0,1,1,0}
Approximate Bitmaps (AB) – Mapping Function F F maps each cell in the bitmap table to a unique string (the hashing string) For one AB per table and one AB per attribute, the bit in row i column j is identified by F(i,j) = i << w || j, where w is large enough to accommodate all j For one AB per column, the bit in row i is identified by F(i,j) = i
Approximate Bitmaps (AB) – Hash Functions Single Hash Function Called once and the result is divided into pieces. Each piece considered as the value of a different hash function. Secure Hash Algorithm (SHA), developed by National Institute of Standards and Technology (NIST) Multiple Hash Functions Independent hash functions For large number, similar performance Hash Function H0 H1 H2... H9 Bits SHA Output
Approximate Bitmaps (AB) – FP Rate FP Rate: Probability that all k bits are set by another data object n is the size of the AB s is the number of set bits n = αs, α = n/ s
Approximate Bitmaps (AB) – Size In terms of α : n = αs m = ceil(log 2 ( αs)) One AB per dataset: s = |A|*N One AB per attribute: s = N One AB per column: s depends on the data distribution
Experimental Setup Three datasets: RowsAttributesColumns Uniform100, Landsat275, HEP2,173, Query by sampling (randomly selecting the columns queried) Varying the number of rows queried from 100 to 10K
Experimental Results - Size Always use the max α that produces a smaller or comparable AB than WAH
Experimental Results - Precision As α increases, the precision increases steadily and is very close to 1 for larger α Precision increases as k increases up to the optimum point Because large number of hash functions produces more collisions
Experimental Results – Exec Time Execution time of the AB depends on the number of rows queried, not in the number of rows in the dataset For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH
Conclusion AB encoding approximates the bitmaps using multiple hashing of the set bits Allows efficient retrieval of any subset of rows and columns Trade-off between bitmap size and precision Three levels of encoding Approximate query answers are given without database access
Questions and Comments Thank you!