Optimizing Data Popularity Conscious Bloom Filters Kai Shen 12/5/2018 Optimizing Data Popularity Conscious Bloom Filters Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester PODC 2008
Problem Overview Bloom filters: Kai Shen 12/5/2018 Problem Overview Bloom filters: compact set representation in which each object is hashed into several bits in the filter; allows possible false positives in membership queries; useful in distributed applications communicating sets. Highly skewed data popularity distributions. Data popularity conscious Bloom filters: use a large number of hashes for likely false positive candidates – popular objects in queries; unpopular objects in sets. Goal: customize the hash number for each object to minimize the false positive prob. 12/5/2018 PODC 2008 PODC 2008
Object Popularity Stability Kai Shen 12/5/2018 Object Popularity Stability Stable object popularity is important for learning the object popularity and for low adjustment overhead. Illustration of stability across month-long trace segments: 12/5/2018 PODC 2008 PODC 2008
Problem Formulation and Result Kai Shen 12/5/2018 Problem Formulation and Result Problem formulation: in a universe of N objects, an n-object set is represented by an m-bit filter; object i’s membership pop. is pi, non-member query pop. is q’i; find object hash numbers k1, k2, …, kN to minimize the false positive probability ∑1≤i≤N q’i ∙ pow(B,ki); B is the probability for an arbitrary filter bit to be 1, therefore ∑1≤i≤N pi ∙ ki = K = ln(1-B) / (n ∙ ln(1-1/m)). Result (assume ki‘s are unrestricted real numbers): Lagrangian function: ∑1≤i≤N q’i ∙ pow(B,ki) + λ ∙ (∑1≤i≤N pi ∙ ki – K); optimization is reached when the function’s partial derivatives on ki’s and λ are all zero; we find ki = C + log1/B(q’i/pi), C is a constant; also B = 0.5. 12/5/2018 PODC 2008 PODC 2008
Ranged Integer Problem Kai Shen 12/5/2018 Ranged Integer Problem Practical constraint: object i’s hash number ki must be a positive integer, and often upper-bounded by kmax. Rounding real-number solutions to integers: may increase the false positive rate; no understanding on how much the increase may be. Overview of our approach: introduce an importance score for each object (intuitively more important objects desire more hashes); the importance ranking helps produce fast approximation solutions. 12/5/2018 PODC 2008 PODC 2008
Object Importance Score Kai Shen 12/5/2018 Object Importance Score Intuition: revisit the optimal real-number solution: ki = C + log2(q’i/pi); Hint: q’i/pi provides a ranking on object hash numbers in a “good” solution. Results: for the ranged real-number problem, an optimal solution k1, k2, …, kN must follow the importance ranking; └k1┘, └k2┘, …,└kN ┘is a 2-approximation solution to the ranged integer problem; it also follows the importance ranking. 12/5/2018 PODC 2008 PODC 2008
Polynomial-Time 2-Approximation Kai Shen 12/5/2018 Polynomial-Time 2-Approximation Our result indicates that at least one solution that follows the importance score ranking is provably 2-approximation. ⇒ If we enumerate all importance-ranked solutions, the best is a 2-approximation. O(Nkmax) time 2-approximation: no more than (N+1)kmax-1 importance-ranked solutions in total; it takes O(N) to check constraint and calculate the false positive rate for each solution. Practically expensive: N can be huge; the constant kmax may not be very small (e.g., 20). 12/5/2018 PODC 2008 PODC 2008
Faster Solutions (2+ε)-approximation: Coarse-grained optimization: Kai Shen 12/5/2018 Faster Solutions (2+ε)-approximation: the problem of identifying the best importance-ranked solution can be transformed into a knapsack problem; dynamic programming produces (2+ε)-approximation solution in O(N2/ε) time. Coarse-grained optimization: partition large number of objects into a small number of groups (objects in each group have similar importance scores); optimize at the group granularity (then assign equal hash number to objects within one group) ⇒ much smaller N. 12/5/2018 PODC 2008 PODC 2008
Evaluation on Synthetic Data Kai Shen 12/5/2018 Evaluation on Synthetic Data Non-member query pop. q’i follows Zipf-like distribution. Membership pop. pi follows a uniform distribution. Our integer approximation solution significantly outperforms the real-rounding solution, particularly at high popularity skewness. 12/5/2018 PODC 2008 PODC 2008
Trace-driven Evaluation on Distributed Caching Kai Shen 12/5/2018 Trace-driven Evaluation on Distributed Caching Distributed caches exchange their content (set of cached web objects) to cooperate. Evaluation driven by web access traces from IRCache.net. 12/5/2018 PODC 2008 PODC 2008
Trace-driven Evaluation on Distributed Keyword Searching Kai Shen 12/5/2018 Trace-driven Evaluation on Distributed Keyword Searching Distributed search engines pass keyword indexes to support distributed joins. False positives resolved by additional comm. Evaluation driven by web page listing at dmoz.com and keyword query traces at Ask.com. 12/5/2018 PODC 2008 PODC 2008
Related Work Compressed Bloom filters [Mitzenmacher 2002]. Kai Shen 12/5/2018 Related Work Compressed Bloom filters [Mitzenmacher 2002]. Bloom filters with additional functionalities: deletion [Fan et al. 2000]; frequency queries [Cohen and Matias 2003]; associating objects with values [Chazelle et al. 2004]. Alternative data structure [Pagh et al. 2005]. Weighted Bloom filters [Bruck et al. 2006]: optimal real-number solution with integer rounding; analytically, the rounding-induced error increase is unbounded; practically, the error increase can be substantial. 12/5/2018 PODC 2008 PODC 2008
Conclusions Popularity conscious Bloom filters: Kai Shen 12/5/2018 Conclusions Popularity conscious Bloom filters: motivated by skewed, stable data popularity distributions; customize each object’s hash number according to its popularity in sets and queries. Unrestricted real-number problem: optimal solution when object hash number is linear to log(query-pop’/set-pop). Ranged integer problem: query-pop’/set-pop serves as an object importance indicator; O(Nkmax) time 2-approximation; O(N2/ε) time (2+ε)-approximation. Quantitative evaluations driven by real distributed application traces. 12/5/2018 PODC 2008 PODC 2008