Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Slides:



Advertisements
Similar presentations
The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.
Advertisements

The Equivalence of Sampling and Searching Scott Aaronson MIT.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Hebrew Univ., Israel)
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Introduction to Algorithms Jiafen Liu Sept
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.
Cuckoo Hashing : Hardware Implementations Adam Kirsch Michael Mitzenmacher.
Bloom Filters Kira Radinsky Slides based on material from:
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Some Open Questions Related to Cuckoo Hashing Michael Mitzenmacher Harvard University.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Hashing and Packet Level Algorithms
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
Problems and MotivationsOur ResultsTechnical Contributions Membership: Maintain a set S in the universe U with |S| ≤ n. Given an x in U, answer whether.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.
CSC 172 DATA STRUCTURES. SETS and HASHING  Unadvertised in-store special: SETS!  in JAVA, see Weiss 4.8  Simple Idea: Characteristic Vector  HASHING...The.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
ICS 353: Design and Analysis of Algorithms
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
1 4.1 Hash Functions and Data Integrity A cryptographic hash function can provide assurance of data integrity. ex: Bob can verify if y = h K (x) h is a.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Hash table CSC317 We have elements with key and satellite data
Randomized Algorithms
Streaming & sampling.
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
Randomized Algorithms CS648
Randomized Algorithms
Introduction to Algorithms 6.046J/18.401J
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Introduction to Algorithms
Diving Deeper into Chaining and Linear Probing
Minwise Hashing and Efficient Search
Bloom filters From Probability and Computing
Hash Functions for Network Applications (II)
Presentation transcript:

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan

How Collaborations Arise… At a talk I was giving on Bloom filters... –Salil: Your analysis assumes perfectly random hash functions. What do you use in your experiments? –Michael: In practice, it works even with standard hash functions. –Salil: Can you prove it? –Michael: Um…

Question Why do simple hash functions work? –Simple = chosen from a pairwise (or k-wise) independent (or universal) family. Our results are actually more general. –Work = perform just like random hash functions in most real-world experiments. Motivation: Close the divide between theory and practice.

Universal Hash Families Defined by Carter/Wegman Family of hash functions  of form H:[N]  [M] is k-wise independent if when H is chosen randomly, for any x 1,x 2,…x k, and any a 1,a 2,…a k, Family is k-wise universal if

Applications Potentially, wherever hashing is used –Bloom Filters –Power of Two Choices –Linear Probing –Cuckoo Hashing –Many Others…

Review: Bloom Filters Given a set S = {x 1,x 2,x 3,…x n } on a universe U, want to answer queries of the form: Bloom filter provides an answer in –“Constant” time (time to hash). –Small amount of space. –But with some probability of being wrong.

Bloom Filters Start with an m bit array, filled with 0s. Hash each item x j in S k times. If H i (x j ) = a, set B[a] = B B To check if y is in S, check B at H i (y). All k values must be B B Possible to have a false positive; all k values are 1, but y is not in S. n items m = cn bits k hash functions

Power of Two Choices Hashing n items into n buckets –What is the maximum number of items, or load, of any bucket? –Assume buckets chosen uniformly at random. Well-known result:  (log n / log log n) maximum load w.h.p. Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. –Maximum load is log log n / log 2 +  (1) w.h.p. –With d ≥ 2 choices, max load is log log n / log d +  (1) w.h.p.

Power of Two Choices Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. What is the maximum load now? log log n / log 2 +  (1) w.h.p. What if we have d ≥ 2 choices? log log n / log d +  (1) w.h.p.

Linear Probing Hash elements into an array. If h(x) is already full, try h(x)+1,h(x)+2,… until empty spot is found, place x there. Performance metric: expected lookup time.

Not Really a New Question “The Power of Two Choices” = “Balanced Allocations.” Pairwise independent hash functions match theory for random hash functions on real data. Bloom filters. Noted in 1980’s that pairwise independent hash functions match theory for random hash functions on real data. But analysis depends on perfectly random hash functions. –Or sophisticated, highly non-trivial hash functions.

Worst Case : Simple Hash Functions Don’t Work! Lower bounds show result cannot hold for “worst case” input. There exist pairwise independent hash families, inputs for which Linear Probing performance is worse than random [PPR 07]. There exist k-wise independent hash families, inputs for which Bloom filter performance is provably worse than random. Open for other problems. Worst case does not match practice.

Bloom Filters Start with an m bit array, filled with 0s. Hash each item x j in S k times. If H i (x j ) = a, set B[a] = B B To check if y is in S, check B at H i (y). All k values must be B B Possible to have a false positive; all k values are 1, but y is not in S. n items m = cn bits k hash functions

Example: Bloom Filter Analysis Standard Bloom filter argument: –Pr(specific bit of filter is 0) is –If  is fraction of 0 bits in the filter then false positive probability is Analysis depends on random hash function.

Pairwise Independent Analysis Natural approach: use union bounds. –Pr(specific bit of filter is 0) is at least –False positive probability is bounded above by –Implication: need more space for same false positive probability. –Have lower bounds showing this is tight, and generalizes to higher k-wise independence.

Random Data? Analysis usually trivial if data is independently, uniformly chosen over large universe. –Then all hashes appear “perfectly random”. Not a good model for real data. Need intermediate model between worst- case, average case.

A Model for Data Based on models of semi-random sources. –[SV 84], [CG 85] Data is a finite stream, modeled by a sequence of random variables X 1,X 2,…X T. Range of each variable is [N]. Each stream element has some entropy, conditioned on values of previous elements. –Correlations possible. –But each element has some unpredictability, even given the past.

Intuition If each element has entropy, then extract the entropy to hash each element to near- uniform location. Extractors should provide near-uniform behavior.

Notions of Entropy max probability : –min-entropy : –block source with max probability p per block collision probability : –Renyi entropy : –block source with coll probability p per block These “entropies” within a factor of 2. We use collision probability/Renyi entropy.

Leftover Hash Lemma A “classical” result (from 1989). Intuitive statement: If is chosen from a pairwise independent hash function, and X is a random variable with small collision probability, H(X) will be close to uniform.

Leftover Hash Lemma Specific statements for current setting. –For 2-universal hash families. Let be a random hash function from a 2- universal hash family . If cp(X)< 1/K, then (H,H(X)) is -close to (H,U [M] ). –Equivalently, if X has Renyi entropy at least log M + 2log(1/  ), then (H,H(X)) is  -close to uniform. Let be a random hash function from a 2- universal hash family. Given a block-source with coll prob 1/K per block, (H,H(X 1 ),.. H(X T )) is xxxxxxxxxx-close to (H,U [M] T ). –Equivalently, if X has Renyi entropy at least log M + 2log(T/  ), then (H,H(X 1 ),.. H(X T )) is  -close to uniform.

Proof of Leftover Hash Lemma Step 1: cp( (H,H(X)) ) is small. Step 2: Small cp implies close to uniform.

Close to Reasonable in Practice Network flows classified by 5-tuples –N = Power of 2 choices: each flow gets 2 hash bucket values, placed in least loaded. Number buckets number items. –T = 2 16, M = –For K = 2 80, get close to uniform. How much entropy does stream of flow-tuples have? Similar results using Bloom filters with 2 hashes [KM 05], linear probing.

Theoretical Questions How little entropy do we need? Tradeoff between entropy and complexity of hash functions?

Improved Analysis [MV] Can refine Leftover Hash Lemma style analysis for this setting. Idea: think of result as a block source. Let be a random hash function from a 2-universal hash family. Given a block-source with coll prob 1/K per block, (H(X 1 ),.. H(X T )) is  -close to a block source with coll prob 1/M+T/(  K) per block.

4-Wise Independence Further improvements by using 4-wise independent families. Let be a random hash function from a 4-wise independent hash family. Given a block- source with collision probability 1/K per block, (H(X 1 ),.. H(X T )) is  -close to a block source with coll prob 1/M+(1+((2T)/(  M)) 1/2 )/K per block. –Collision probability per block much tighter around 1/M. 4-wise independent possible for practice [TZ 04].

Proof Technique Given bound on cp(X), derive bound on cp(H(X)) that holds with high probability over random H using Markov’s/Chebychev’s inequalities. Union bound/induction argument to extend to block sources. Tighter analyses?

Generality Proofs utilize universal families. Is this necessary? –Does not appear so. Key point: bound cp(H(X)). Can this be done for practical hash functions? –Must think of hash function as randomly chosen from a certain family.

Reasonable in Practice Power of 2 choices: –T = 2 16, M = –Still need K > 2 64 for pairwise independent hash functions, but K < 2 64 for 4-wise independence.

Further Improvements Vadhan and Chung [CV08] improved analysis for tight bounds on entropy needed. Shave an additive log T over previous results. Improvement comes from improved analysis of conditional probabilities, using Hellinger distance instead of statistical distance.

Open Problems Tightening connection to practice. –How to estimate relevant entropy of data streams? –Performance/theory of real-world hash functions? –Generalize model/analyses to additional realistic settings? Block source data model. –Other uses, implications?

[PPR] = Pagh, Pagh, Ruzic [TZ] = Thorup, Zhang [SV] = Santha, Vazirani [CG] = Chor Goldreich [BBR88] = Bennet-Brassard-Robert [ILL] = Impagliazzo-Levin-Luby