Optimizing Data Popularity Conscious Bloom Filters

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Analysis of Algorithms
Fast Algorithms For Hierarchical Range Histogram Constructions
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Cuckoo Filter: Practically Better Than Bloom
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Overview of Search Engines
Classification and Prediction: Regression Analysis
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Web Caching and Content Distribution: A View From the Interior Syam Gadde Jeff Chase Duke University Michael Rabinovich AT&T Labs - Research.
TinyLFU: A Highly Efficient Cache Admission Policy
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Efficient Peer to Peer Keyword Searching Nathan Gray.
The Simigle Image Search Engine Wei Dong
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh Gil Segev Udi Wieder IT University of Copenhagen Stanford Microsoft Research.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Dense-Region Based Compact Data Cube
Information Retrieval in Practice
Outline Introduction State-of-the-art solutions
Data Driven Resource Allocation for Distributed Learning
Hard Problems Some problems are hard to solve.
Efficient Multi-User Indexing for Secure Keyword Search
Indexing Structures for Files and Physical Database Design
International Conference on Data Engineering (ICDE 2016)
Information Retrieval in Practice
The Impact of Replacement Granularity on Video Caching
Solver & Optimization Problems
3.3 Real Zeros of Polynomials
The Variable-Increment Counting Bloom Filter
4.2 Real Zeros Finding the real zeros of a polynomial f(x) is the same as solving the related polynomial equation, f(x) = 0. Zero, solution, root.
Effects of Zeros and Additional Poles
Augmented Sketch: Faster and More Accurate Stream Processing
TT-Join: Efficient Set Containment Join
Statistical Optimal Hash-based Longest Prefix Match
Edge computing (1) Content Distribution Networks
Indexing and Hashing Basic Concepts Ordered Indices
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Index Use Cases.
Replication Degree Customization for High Availability
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Chapter 11 Indexing And Hashing (1)
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Adaptive Choice of Information Sources
Hash Functions for Network Applications (II)
Lecture 1: Bloom Filters
Presentation transcript:

Optimizing Data Popularity Conscious Bloom Filters Kai Shen 12/5/2018 Optimizing Data Popularity Conscious Bloom Filters Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester PODC 2008

Problem Overview Bloom filters: Kai Shen 12/5/2018 Problem Overview Bloom filters: compact set representation in which each object is hashed into several bits in the filter; allows possible false positives in membership queries; useful in distributed applications communicating sets. Highly skewed data popularity distributions. Data popularity conscious Bloom filters: use a large number of hashes for likely false positive candidates – popular objects in queries; unpopular objects in sets. Goal: customize the hash number for each object to minimize the false positive prob. 12/5/2018 PODC 2008 PODC 2008

Object Popularity Stability Kai Shen 12/5/2018 Object Popularity Stability Stable object popularity is important for learning the object popularity and for low adjustment overhead. Illustration of stability across month-long trace segments: 12/5/2018 PODC 2008 PODC 2008

Problem Formulation and Result Kai Shen 12/5/2018 Problem Formulation and Result Problem formulation: in a universe of N objects, an n-object set is represented by an m-bit filter; object i’s membership pop. is pi, non-member query pop. is q’i; find object hash numbers k1, k2, …, kN to minimize the false positive probability ∑1≤i≤N q’i ∙ pow(B,ki); B is the probability for an arbitrary filter bit to be 1, therefore ∑1≤i≤N pi ∙ ki = K = ln(1-B) / (n ∙ ln(1-1/m)). Result (assume ki‘s are unrestricted real numbers): Lagrangian function: ∑1≤i≤N q’i ∙ pow(B,ki) + λ ∙ (∑1≤i≤N pi ∙ ki – K); optimization is reached when the function’s partial derivatives on ki’s and λ are all zero; we find ki = C + log1/B(q’i/pi), C is a constant; also B = 0.5. 12/5/2018 PODC 2008 PODC 2008

Ranged Integer Problem Kai Shen 12/5/2018 Ranged Integer Problem Practical constraint: object i’s hash number ki must be a positive integer, and often upper-bounded by kmax. Rounding real-number solutions to integers: may increase the false positive rate; no understanding on how much the increase may be. Overview of our approach: introduce an importance score for each object (intuitively more important objects desire more hashes); the importance ranking helps produce fast approximation solutions. 12/5/2018 PODC 2008 PODC 2008

Object Importance Score Kai Shen 12/5/2018 Object Importance Score Intuition: revisit the optimal real-number solution: ki = C + log2(q’i/pi); Hint: q’i/pi provides a ranking on object hash numbers in a “good” solution. Results: for the ranged real-number problem, an optimal solution k1, k2, …, kN must follow the importance ranking; └k1┘, └k2┘, …,└kN ┘is a 2-approximation solution to the ranged integer problem; it also follows the importance ranking. 12/5/2018 PODC 2008 PODC 2008

Polynomial-Time 2-Approximation Kai Shen 12/5/2018 Polynomial-Time 2-Approximation Our result indicates that at least one solution that follows the importance score ranking is provably 2-approximation. ⇒ If we enumerate all importance-ranked solutions, the best is a 2-approximation. O(Nkmax) time 2-approximation: no more than (N+1)kmax-1 importance-ranked solutions in total; it takes O(N) to check constraint and calculate the false positive rate for each solution. Practically expensive: N can be huge; the constant kmax may not be very small (e.g., 20). 12/5/2018 PODC 2008 PODC 2008

Faster Solutions (2+ε)-approximation: Coarse-grained optimization: Kai Shen 12/5/2018 Faster Solutions (2+ε)-approximation: the problem of identifying the best importance-ranked solution can be transformed into a knapsack problem; dynamic programming produces (2+ε)-approximation solution in O(N2/ε) time. Coarse-grained optimization: partition large number of objects into a small number of groups (objects in each group have similar importance scores); optimize at the group granularity (then assign equal hash number to objects within one group) ⇒ much smaller N. 12/5/2018 PODC 2008 PODC 2008

Evaluation on Synthetic Data Kai Shen 12/5/2018 Evaluation on Synthetic Data Non-member query pop. q’i follows Zipf-like distribution. Membership pop. pi follows a uniform distribution. Our integer approximation solution significantly outperforms the real-rounding solution, particularly at high popularity skewness. 12/5/2018 PODC 2008 PODC 2008

Trace-driven Evaluation on Distributed Caching Kai Shen 12/5/2018 Trace-driven Evaluation on Distributed Caching Distributed caches exchange their content (set of cached web objects) to cooperate. Evaluation driven by web access traces from IRCache.net. 12/5/2018 PODC 2008 PODC 2008

Trace-driven Evaluation on Distributed Keyword Searching Kai Shen 12/5/2018 Trace-driven Evaluation on Distributed Keyword Searching Distributed search engines pass keyword indexes to support distributed joins. False positives resolved by additional comm. Evaluation driven by web page listing at dmoz.com and keyword query traces at Ask.com. 12/5/2018 PODC 2008 PODC 2008

Related Work Compressed Bloom filters [Mitzenmacher 2002]. Kai Shen 12/5/2018 Related Work Compressed Bloom filters [Mitzenmacher 2002]. Bloom filters with additional functionalities: deletion [Fan et al. 2000]; frequency queries [Cohen and Matias 2003]; associating objects with values [Chazelle et al. 2004]. Alternative data structure [Pagh et al. 2005]. Weighted Bloom filters [Bruck et al. 2006]: optimal real-number solution with integer rounding; analytically, the rounding-induced error increase is unbounded; practically, the error increase can be substantial. 12/5/2018 PODC 2008 PODC 2008

Conclusions Popularity conscious Bloom filters: Kai Shen 12/5/2018 Conclusions Popularity conscious Bloom filters: motivated by skewed, stable data popularity distributions; customize each object’s hash number according to its popularity in sets and queries. Unrestricted real-number problem: optimal solution when object hash number is linear to log(query-pop’/set-pop). Ranged integer problem: query-pop’/set-pop serves as an object importance indicator; O(Nkmax) time 2-approximation; O(N2/ε) time (2+ε)-approximation. Quantitative evaluations driven by real distributed application traces. 12/5/2018 PODC 2008 PODC 2008