Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Slides:



Advertisements
Similar presentations
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Advertisements

Hash Tables.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Theory I Algorithm Design and Analysis (5 Hashing) Prof. Th. Ottmann.
© 2004 Goodrich, Tamassia Hash Tables1  
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Log Files. O(n) Data Structure Exercises 16.1.
Data Structures – LECTURE 11 Hash tables
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
Dictionaries and Hash Tables1  
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Look-up problem IP address did we see the IP address before?
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 9 Hash Tables (continued) Reminder Examples.
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Data Structures Hashing Uri Zwick January 2014.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures Week 6 Further Data Structures The story so far  We understand the notion of an abstract data type.  Saw some fundamental operations.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Hash Tables1   © 2010 Goodrich, Tamassia.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
October 5, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL7.1 Prof. Charles E. Leiserson L ECTURE 8 Hashing II Universal hashing Universality.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Instructor Neelima Gupta Expected Running Times and Randomized Algorithms Instructor Neelima Gupta
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
CS 5243: Algorithms Hash Tables.
CS210- Lecture 17 July 12, 2005 Agenda Collision Handling
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
CS 3343: Analysis of Algorithms
Presentation transcript:

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1

Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 2

Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 3

Hash Tables 4  Randomized Data Structure  Implementing the “Dictionary” abstract data type (ADT)  Insert  Delete  Lookup  We’ll assume no duplicates  Applications  Symbol Tables/Compilers: which variables already declared?  ISPs: is IP address spam/blacklisted?  many others

Setup 5  Universe U of all possible elements  All possible 2 32 IP addresses  All possible variables that can be declared  Maintain a possibly evolving subset S ⊆ U  |S|=m and |U| >> m  S might be evolving

Naïve Dictionary Implementations (1) 6  Option 1: Bit Vectors  An array A, keeping one bit 0/1 for each element of U  Insert element i => A[i] = 1  Delete element i => A[i] = 0  Lookup element i => return A[i] == 1  Time complexity of every operation is O(1)  Space: O(U) (e.g for IP addresses) Quick but Not Scalable!

Naïve Dictionary Implementations (2) 7  Option 2: Linked List  One entry for each element in S  Insert element i => check if i exists, if not append to list  Delete element i => find i in the list and remove  Lookup element i => go through the entire list  Time complexity of every operation is O(|S|)  Space: O(S) space Scalable but Not Quick!

Hash Tables: Best of Both Worlds 8  Randomized Dictionary that is:  Quick: O(1) expected time for each operation  Scalable: O(S) space

Hash Tables: High-level Idea 9  Buckets: distinct locations in the hash table  Let n be # buckets  n ≈ m (recall m =|S|)  i.e., Load factor: m/n = O(1)  Hash Function h: U -> {0, 1, …, n-1}  We store each element x in bucket h(x) U: universe m = size of S n = # buckets

Hash Tables: High-level Idea n- 2 n- 1 h U … U: universe m = size of S n = # buckets

Collisions 11  Multiple elements are hashed to the same bucket.  Assume we are about to insert new x and h(x) is already full  Resolving Collisions:  Chaining: Linked list per bucket; append x to the list  Open Addressing: If h(x) is already full, we deterministically assign x to another empty bucket  Saves space

Chaining n-2 n-1 Null … e 3 = h(e 3 )=1 U: universe m = size of S n = # buckets

Chaining n-2 n-1 Null … e3e3 e 7 = h(e 7 )=n-2 U: universe m = size of S n = # buckets

Chaining n-2 n-1 Null … e3e3 e 5 = h(e 5 )=n-1 e7e7 U: universe m = size of S n = # buckets

Chaining n-2 n-1 Null … e3e3 e 1 = h(e 1 )=1 e7e7 e5e5 U: universe m = size of S n = # buckets

Chaining n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e 4 = h(e 4 )=1 U: universe m = size of S n = # buckets

Chaining n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 U: universe m = size of S n = # buckets

Operations (With Chaining) 18  Insert(x): Go to bucket h(x); If x is not in list, append it.  Delete(x): Go to bucket h(x); If x is in list, delete it.  Lookup(x): Go to bucket h(x); Return true if x is in the list

Running Time of Operations 19  Assume evaluating the hash function is constant time  May not be true for all hash functions  Consider an element x n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 Lookup: O(|Linked list h(x)|) Insert: O(|Linked list h(x)|) Delete: O(|Linked list h(x)|) U: universe m = size of S n = # buckets

Worst & Best Scenarios 20  Assume m: # elements in the hash table  Worst Case: O(m)  Best Case: O(1) |Linked lists| depends on the quality of the hash function! Fundamental Question: How can we choose “good” hash functions? U: universe m = size of S n = # buckets

Bad Hash Functions 21  Recall our IP addresses example: 32 bits  # buckets n = 2 8  Idea: Use most significant 8 bits  Big correlations with geography of how IP addresses are assigned: 171, 172 as the first 8 bits is common  Lots of addresses would get mapped to the same bucket In practice should be very careful when picking hash functions! U: universe m = size of S n = # buckets

Is There A Single Good Hash Function? 22  Idea: Design a clever hash function **h** that spreads every data sets evenly across the buckets.  Problem: Cannot exist! n- 2 n- 1 **h** U Recall |U| >> m ≈n by pigeonhole: ∃ bucket i, s.t. |list i| ≥ |U|/n If S is all from i, then all operations O(m)!

No Single Good Hash Function! 23 Claim: For every single hash function h, there is a pathological data set! Proof: By pigeonhole principle

Solution: Pick a Hash Function Randomly 24 Design a set or a “family” H of hash functions, s.t. ∀ data sets S, if we pick a h ∈ H randomly, then almost always we spread S out evenly across buckets. Question: Why couldn’t you have randomness inside your hash function?

Clarification on Proposed Analysis Hash Table Input:S Performance Pick h randomly from H We’ll analyze the expected performance on any but fixed input S.

Clarification on Proposed Analysis Hash Table Input:S Performance 1 Pick h 1 randomly from H Hash Table Input:S Performance 2 Pick h 2 randomly from H … Hash Table Input:S Performance t Pick h t randomly from H

Roadmap 27 1.Define H being “Universal” 2.If H is universal and we pick h ∈ H randomly, then our hash table has O(1) expected cost 3.Show simple and practical H exist.

1. Universal Family of Hash Functions 28 Let H be a set of functions from |U| to {0, 1, …, n-1}. Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y, if h is chosen uniformly at random from H then: Pr(h(x) = h(y)) ≤ 1/n I.e., the fraction of hash functions of H that make x & y collide is at most 1/n Why 1/n? “As if we were independently mapping x, y to buckets (& uniformly at random).” U: universe m = size of S n = # buckets

2. Universality => Operations Are O(1) 29 Let H be a universal family of hash functions from |U| to {0, 1, …, n-1}. Recall m = O(n) Claim: If h is picked randomly from H => for any data set S, hash table operations are O(1). U: universe m = size of S n = # buckets

2. Universality => Operations Are O(1) 30 Proof: U: universe m = size of S n = # buckets Hash Table S Pick h randomly from H n-2 n-1 … e3e3 e7e7 e5e5 e1e1 e4e4 e9e9 e 27 A new element x arrives. Say we want to perform Lookup(x). Cost: O(# elements in bucket h(x)). This quantity is a random variable. Call it Z.

2. Universality => Operations Are O(1) 31 Proof Continued: Z=# elements in bucket h(x). For each element y ∈ S, let X y be 1 if h(y) = h(x). U: universe m = size of S n = # buckets 1 is in case x is already there Q.E.D

3. Universal Families of HF Exist (1) 32 Let n=2 b, |U|=2 t and t>b Represent each x as t bit binary vector Ex: |U|=2 7 =128, hash table has size 2 4 =16 |U| = 2 t n = 2 b M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 bucket 12 element 52

3. Universal Families of HF Exist (2) 33 h(x): Mx: 2 t -> 2 b or U -> {0, 1, …, n-1} H = All possible b x t 0/1 random matrices M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 |U| = 2 t n = 2 b

Proof that H is Universal (1) 34 Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n = 1/2 b, when M is picked uniformly at random from H. => equivalently when each cell of M is picked randomly M x h(x) = |U| = 2 t n = 2 b

Proof that H is Universal (2) 35 x, y differ in at least one bit (say w.l.o.g., the last bit) let z = x-y z1z1 z2z2 z3z3 z4z4 z5z5 … MzMz = Q: Pr(Mz =0)? |U| = 2 t n = 2 b

Proof that H is Universal (3) 36 Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0) **Event Mz[i]=0 is independent from Mz[j]=0 since, the coin flips for Mz[i] are independent from the coin flips for Mz[j]** Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) … Pr(Mz[b]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … M z Mz = Q: Pr(Mz[i] =0)? |U| = 2 t n = 2 b

Proof that H is Universal (4) 37 Pr(Mz[i]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … M z Mz = Mz[i] = m i1 z 1 + m i2 z 2 + … 1*m it Let y be the (modulo 2) sum of the first t-1 multiplications, Mz[i] = 0 iff m it is equal to ¬y! i

Proof that H is Universal (5) 38  Pr(Mz[i] =0) = 1/2  Pr(Mx=My)=Pr(Mz = 0) = 1/2 b = 1/n Irrespective of the fist t-1 coin flips, it all depends on the last coin flip. |U| = 2 t n = 2 b Q.E.D

Storing and Evaluating Hash Function h (M) 39 Q: How much space do we need to store the random matrix M? A: bt bits = O(log|U|log(n)) How much time to evaluate Mz? A: Naïve: bt 2 =O(log|U|log(n)) Summary: H is a relatively fast, and practical universal family of hash functions

Another Possible Family 40 We’re hashing from U -> {0, 1, …, n-1} Let H be the set of all such functions Question: Is H universal?

Another Possible Family 41 We’re hashing from U -> {0, 1, …, n-1} Q1: # such functions? A1: n U Q2: # functions in which h(x) = h(y)=j? A2: n U-2 Q3: # functions in which h(x) = h(y)? A3: nn U-2 = n U-1 Q4: Pr(h(x) = h(y)? Answer: 1/n => H is universal!

Why is H Impractical? 42 There are n U functions in H. What’s cost of storing a function h from H? log(| H |)=O(Ulog(n)! Not Practical!

Summary 1.Hash Tables 2.Defined Universal Family of Hash Functions 3.Universal family => Hash Table ops are expected O(1) time 4.Universal families exist 43

Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 44

Bloom Filters 45  Randomized Data Structure  Implementing a limited version of Dictionary ADT  Insert  Lookup  Compared to Hash Tables:  Applications  Website caches for ISPs

Same Setup As Hash Tables 46  Universe U of all possible elements  All possible 2 32 IP addresses  Maintain a subset S ⊆ U  |S|=m and |U| >> m

Bloom Filters 47  A Bloom Filter consists of:  A bit array of size n initially all 0 (not buckets)  k hash functions h 1, …, h k Space cost per element= n/m

Insertions 48  Insert(a): set all h i (a) to 1 => O(k)  Let k = 3 h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 h 1 (y)=1, h 2 (y)=5 h 3 (y)=9 h 1 (z)=10, h 2 (z)=11 h 3 (z)=5 Do you see why there would be false positives?

Lookup 49 Lookup(a): return true if all h i (a) = 1 => O(k) x: h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 => Lookup(x) = true z: h 1 (z)=3, h 2 (z)=9 h 3 (z)=4 => Lookup(z) = false t: h 1 (t)=0, h 2 (t)=1 h 3 (t)=2 => Lookup(t) = true

Can Bloom Filters Be Useful? 50 Can Bloom Filters be both space efficient and have a low false positive rate? What is the probability of false positives as a function of n, m and k?

Probability of False Positive 51  We have inserted m elements to the bloom filter.  New element z arrives, not been inserted before.  Q: What’s the Pr(false positive for z)?  Assume h 1 (z) = j 1, …, h k (z) = j k **Simplifying (Unjustified) Assumption**: All hashing is totally random! ∀ h i, ∀ x, h i (x) is uniformly random from {1, …, m} and independent from all h j (y) for all y. Warning: To simplify analysis. Won’t hold in practice.

Pr(bit j is 1 after m insertions)? 52 Consider a particular bit j in the array. Q1: Fix h i and an element x. Pr(h i (x) turns j to 1)? A1: 1/n Q2: Pr(x turns j to 1)? (Prob. one of h 1 (x), …, h k (x) = j?) A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n) k Q3: Pr(Bit j = 1 after m insertions)? A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n) km

Pr(false positive for x)? 53 Recall for x we check k bits: h 1 (x) = j 1, …, h k (x) = j k Pr(bit j i = 1) = 1 – (1-1/n) km Pr(false positive) = Pr(all j i = 1)= (1 – (1-1/n) km ) k Recall Calculus fact: (1+x) ≤ e x From the same fact: around x=0, (1+x) ≈ e x Pr(false positive) ≈

How Does Failure Rate Change With k,n? 54 Observation 1: as n increases failure rate decreases. Observation 2: as k increases(the # hash functions)  more bits to check => less likely to fail  more bits/object => more likely to fail  unclear if it increases or decreases Question: What’s the optimal k for fixed n/m? Answer (by taking derivatives): k=ln(2)n/m = 0.69*n/m Failure rate =

How Does Failure Rate Change With k,n? 55 For fixed n/m, with optimal k=ln(2)n/m Failure rate: Already at n=8m, rate is 1-2%. Exponentially decrease with n/m.

56 Next Week Dynamic Programming