Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

Similar presentations


Presentation on theme: "Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio."— Presentation transcript:

1 Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio

2 Presentation Outline Motivation Goal Approximate Bitmaps (AB) encoding AB example Theoretical analysis Experiments and Results Conclusion

3 Motivation Bitmap indices  Data warehouses  Scientific data  Visualization applications  Bitwise operations Bitmap Compression  Run-length encoders Word Aligned Hybrid (WAH) Byte-aligned Bitmap Code (BBC)

4 Motivation The row numbers do not longer correspond to the bit position in the bitmap Queries over few particular rows  As expensive as queries asking for all the rows Commonly, users are only interested in a small subset of the dataset at a time. For example:  A query over the transactions of the last 7 days  Spatial queries over objects in a specific geographical area

5 Motivation Visualization applications  Millions of different readings ordered by their geographic location  Users ask range queries over some of the readings for a given area  The answers are highlighted in the screen  Several degrees of resolution make approximate answers acceptable

6 Our Goal Enable direct access over any subset of the bitmap Achieve effective compression Maintain bitwise operations for query execution Trade-off efficiency vs. accuracy  No false negatives

7 The approach Our solution is inspired by Bloom Filters  A 2 m bit array indexed using k independent hash functions  A data object is inserted by setting the k positions in the array corresponding to the hash values of the object  False positives can happen, but false negatives cannot

8 Approximate Bitmaps (AB) A bloom filter-like structure Only the set bits are inserted into the AB Three levels of encoding:  Per table, per attribute, per bitmap column Parameters:  The hash string mapping function, F  The k hash functions, {H 1 (x),…,H k (x)}  The size of the AB, n = αs = 2 m Precision in terms of α and k, ~(1-(1-e -k/α ) k )

9 AB Example 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories. Bitmap Table Size: 72 bits Number of set bits = 24. F(i,j) = concatenate(i,j) = x H 1 (x) = x mod 32 m = 5 AB Size: 2 5 = 32 bits

10 AB Example - Insertion Initially all bits in the AB are zero To insert set bit in (1,1) 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0000000000000000000000000000000000000000000000000000000000000000

11 AB Example - Insertion 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0000000000010000000000000000000000000000000100000000000000000000 To insert set bit in (1,1)  x = 11  H(11) = 11 mod 32 = 11  AB(11) = 1

12 AB Example - Insertion To insert set bit in (5,4)  x = 54  H(54) = 54 mod 32 = 22  AB(22) = 1 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0000000000010000000000100000000000000000000100000000001000000000

13 AB Example - Insertion After all insertions 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0111010010010010110100100100110001110100100100101101001001001100

14 AB Example - Analysis The underlined positions are false positives Only 8 out of the 48 zeros are set in the AB 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0111010010010010110100100100110001110100100100101101001001001100 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 Estimated Precision:  α = ABSize/Set Bits  α = 32/24 = 1.33  k = 1  FP = (1-e -k/α )  P = 1-FP  P = 1-(1-e -1/1.33 )  P = 47%

15 AB Example - Retrieval Consider this query, asking for 4 rows 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0111010010010010110100100100110001110100100100101101001001001100 This a range query over 4 rows, where the third attribute falls into C1 or C2 Row 4:  (4,7): H(47) = 15 AB(15)=0  (4,8): H(48) = 16 AB(16)=1 Row 5:  (5,7): H(57) = 25 AB(25)=1  Stop

16 AB Example - Retrieval Consider this query, asking for 4 rows 123456789 A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C3 1100001001 2010010010 3001100100 4001001001 5100100100 6100010100 7010010010 8001001001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0111010010010010110100100100110001110100100100101101001001001100 Row 6:  (6,7): H(67) = 3 AB(67)=1 Stop Approx Query Answer:  {1,1,1,0} Exact Answer:  {0,1,1,0}

17 Approximate Bitmaps (AB) – Mapping Function F F maps each cell in the bitmap table to a unique string (the hashing string) For one AB per table and one AB per attribute, the bit in row i column j is identified by  F(i,j) = i << w || j, where w is large enough to accommodate all j For one AB per column, the bit in row i is identified by  F(i,j) = i

18 Approximate Bitmaps (AB) – Hash Functions Single Hash Function  Called once and the result is divided into pieces.  Each piece considered as the value of a different hash function.  Secure Hash Algorithm (SHA), developed by National Institute of Standards and Technology (NIST) Multiple Hash Functions  Independent hash functions  For large number, similar performance Hash Function H0 H1 H2... H9 Bits 159..144 143..128 127..112... 15..0 SHA Output 0100100010001010 10000101001000010111100011100010... 0000010101110011

19 Approximate Bitmaps (AB) – FP Rate FP Rate: Probability that all k bits are set by another data object n is the size of the AB s is the number of set bits n = αs, α = n/ s

20 Approximate Bitmaps (AB) – Size In terms of α :  n = αs  m = ceil(log 2 ( αs)) One AB per dataset:  s = |A|*N One AB per attribute:  s = N One AB per column:  s depends on the data distribution

21 Experimental Setup Three datasets: RowsAttributesColumns Uniform100,0002100 Landsat275,46560900 HEP2,173,762666 Query by sampling (randomly selecting the columns queried) Varying the number of rows queried from 100 to 10K

22 Experimental Results - Size Always use the max α that produces a smaller or comparable AB than WAH

23 Experimental Results - Precision As α increases, the precision increases steadily and is very close to 1 for larger α Precision increases as k increases up to the optimum point Because large number of hash functions produces more collisions

24 Experimental Results – Exec Time Execution time of the AB depends on the number of rows queried, not in the number of rows in the dataset For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH

25 Conclusion AB encoding approximates the bitmaps using multiple hashing of the set bits Allows efficient retrieval of any subset of rows and columns Trade-off between bitmap size and precision Three levels of encoding Approximate query answers are given without database access

26 Questions and Comments Thank you! Email: canahuat@cse.ohio-state.edu


Download ppt "Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio."

Similar presentations


Ads by Google