Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

Slides:



Advertisements
Similar presentations
Hashing.
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Indian Statistical Institute Kolkata
Reverse Furthest Neighbors in Spatial Databases Bin Yao, Feifei Li, Piyush Kumar Florida State University, USA.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Bloom Filters Kira Radinsky Slides based on material from:
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Modern Information Retrieval
BTrees & Bitmap Indexes
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Chapter 7 Sampling and Sampling Distributions
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Look-up problem IP address did we see the IP address before?
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Payload Attribution via Hierarchical Bloom Filters
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Evaluating Classifiers
Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Mining of Massive Datasets Ch4. Mining Data Streams
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hash Table.
Spatial Online Sampling and Aggregation
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
DATA CACHING IN WSN Mario A. Nascimento Univ. of Alberta, Canada
Minwise Hashing and Efficient Search
Hash Functions for Network Applications (II)
Presentation transcript:

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio

Presentation Outline Motivation Goal Approximate Bitmaps (AB) encoding AB example Theoretical analysis Experiments and Results Conclusion

Motivation Bitmap indices  Data warehouses  Scientific data  Visualization applications  Bitwise operations Bitmap Compression  Run-length encoders Word Aligned Hybrid (WAH) Byte-aligned Bitmap Code (BBC)

Motivation The row numbers do not longer correspond to the bit position in the bitmap Queries over few particular rows  As expensive as queries asking for all the rows Commonly, users are only interested in a small subset of the dataset at a time. For example:  A query over the transactions of the last 7 days  Spatial queries over objects in a specific geographical area

Motivation Visualization applications  Millions of different readings ordered by their geographic location  Users ask range queries over some of the readings for a given area  The answers are highlighted in the screen  Several degrees of resolution make approximate answers acceptable

Our Goal Enable direct access over any subset of the bitmap Achieve effective compression Maintain bitwise operations for query execution Trade-off efficiency vs. accuracy  No false negatives

The approach Our solution is inspired by Bloom Filters  A 2 m bit array indexed using k independent hash functions  A data object is inserted by setting the k positions in the array corresponding to the hash values of the object  False positives can happen, but false negatives cannot

Approximate Bitmaps (AB) A bloom filter-like structure Only the set bits are inserted into the AB Three levels of encoding:  Per table, per attribute, per bitmap column Parameters:  The hash string mapping function, F  The k hash functions, {H 1 (x),…,H k (x)}  The size of the AB, n = αs = 2 m Precision in terms of α and k, ~(1-(1-e -k/α ) k )

AB Example A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories. Bitmap Table Size: 72 bits Number of set bits = 24. F(i,j) = concatenate(i,j) = x H 1 (x) = x mod 32 m = 5 AB Size: 2 5 = 32 bits

AB Example - Insertion Initially all bits in the AB are zero To insert set bit in (1,1) A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C

AB Example - Insertion A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C To insert set bit in (1,1)  x = 11  H(11) = 11 mod 32 = 11  AB(11) = 1

AB Example - Insertion To insert set bit in (5,4)  x = 54  H(54) = 54 mod 32 = 22  AB(22) = A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C

AB Example - Insertion After all insertions A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C

AB Example - Analysis The underlined positions are false positives Only 8 out of the 48 zeros are set in the AB A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C Estimated Precision:  α = ABSize/Set Bits  α = 32/24 = 1.33  k = 1  FP = (1-e -k/α )  P = 1-FP  P = 1-(1-e -1/1.33 )  P = 47%

AB Example - Retrieval Consider this query, asking for 4 rows A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C This a range query over 4 rows, where the third attribute falls into C1 or C2 Row 4:  (4,7): H(47) = 15 AB(15)=0  (4,8): H(48) = 16 AB(16)=1 Row 5:  (5,7): H(57) = 25 AB(25)=1  Stop

AB Example - Retrieval Consider this query, asking for 4 rows A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 C1C1 C2C2 C3C Row 6:  (6,7): H(67) = 3 AB(67)=1 Stop Approx Query Answer:  {1,1,1,0} Exact Answer:  {0,1,1,0}

Approximate Bitmaps (AB) – Mapping Function F F maps each cell in the bitmap table to a unique string (the hashing string) For one AB per table and one AB per attribute, the bit in row i column j is identified by  F(i,j) = i << w || j, where w is large enough to accommodate all j For one AB per column, the bit in row i is identified by  F(i,j) = i

Approximate Bitmaps (AB) – Hash Functions Single Hash Function  Called once and the result is divided into pieces.  Each piece considered as the value of a different hash function.  Secure Hash Algorithm (SHA), developed by National Institute of Standards and Technology (NIST) Multiple Hash Functions  Independent hash functions  For large number, similar performance Hash Function H0 H1 H2... H9 Bits SHA Output

Approximate Bitmaps (AB) – FP Rate FP Rate: Probability that all k bits are set by another data object n is the size of the AB s is the number of set bits n = αs, α = n/ s

Approximate Bitmaps (AB) – Size In terms of α :  n = αs  m = ceil(log 2 ( αs)) One AB per dataset:  s = |A|*N One AB per attribute:  s = N One AB per column:  s depends on the data distribution

Experimental Setup Three datasets: RowsAttributesColumns Uniform100, Landsat275, HEP2,173, Query by sampling (randomly selecting the columns queried) Varying the number of rows queried from 100 to 10K

Experimental Results - Size Always use the max α that produces a smaller or comparable AB than WAH

Experimental Results - Precision As α increases, the precision increases steadily and is very close to 1 for larger α Precision increases as k increases up to the optimum point Because large number of hash functions produces more collisions

Experimental Results – Exec Time Execution time of the AB depends on the number of rows queried, not in the number of rows in the dataset For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH

Conclusion AB encoding approximates the bitmaps using multiple hashing of the set bits Allows efficient retrieval of any subset of rows and columns Trade-off between bitmap size and precision Three levels of encoding Approximate query answers are given without database access

Questions and Comments Thank you!