Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:

Data Structures Hash Tables

Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols: variable names, procedure names, etc. u Associated data: memory location, call graph, etc. n For a symbol table (also called a dictionary), we care about search, insertion, and deletion n We typically don’t care about sorted order

Hash Tables l More formally: n Given a table T and a record x, with key (= symbol) and satellite data, we need to support: u Insert (T, x) u Delete (T, x) u Search(T, x) n We want these to be fast, but don’t care about sorting the records l The structure we will use is a hash table n Supports all the above in O(1) expected time!

Hashing: Keys l In the following discussions we will consider all keys to be (possibly large) natural numbers l How can we convert floats to natural numbers for hashing purposes? l How can we convert ASCII strings to natural numbers for hashing purposes?

Hashing a String Let A i represent the ascii code for a string character. Since ascii codes go from 0 to 127 we can perform the following calculation. A 3 X 3 + A 2 X 2 + A 1 X 1 +A 0 X 0 = (((A 3 )X + A 2 )X + A 1 )X + A 0 What is this called? Note that if X = 128 we can easily overflow. Solution: public static int hash(String key, int tableSize) { int hashVal = 0; for(int i=0; i < key.length(); i++) hashVal = ( hashVal * 128 + key.charAt(i))%tableSize; return hashVal; }

A better version Hash functions should be fast and distribute the keys equitably! The 37 slows down the shifting of lower keys and we allow overflow to occur(sort of auto mod-ing) which can result in a neg. public static int hash(String key, int tableSize) { int hashVal = 0; for(int i=0; i < key.length(); i++) hashVal = hashVal * 37 + key.charAt(i); hashVal %= tableSize; if( hashVal < 0)hashVal += tableSize;// fix the sign return hashVal; }

Direct Addressing l Suppose: n The range of keys is 0..m-1 n Keys are distinct l The idea: n Set up an array T[0..m-1] in which u T[i] = xif x  T and key[x] = i u T[i] = NULLotherwise n This is called a direct-address table u Operations take O(1) time! u So what’s the problem?

The Problem With Direct Addressing l Direct addressing works well when the range m of keys is relatively small l But what if the keys are 32-bit integers? n Problem 1: direct-address table will have 2 32 entries, more than 4 billion n Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be l Solution: map keys to smaller range 0..m-1 l This mapping is called a hash function

Hash Functions l Next problem: collision T 0 m - 1 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys)

Resolving Collisions l How can we solve the problem of collisions? l Solution 1: chaining l Solution 2: open addressing

Open Addressing l Basic idea: n To insert: if slot is full, try another slot, …, until an open slot is found (probing) n To search, follow same sequence of probes as would be used when inserting the element u If reach element with correct key, return it u If reach a NULL pointer, element is not in table l Good for fixed sets (adding but no deletion) n Example: spell checking l Table needn’t be much bigger than n

Open Addressing(Linear Probing) Let H(k,p) denote the p th position tried for key K where p=0,1,... Thus H(k,0)= h(k) where h is the basic hash function. H(k,p) represents the probe sequence we use to find the key. If key K hashes to index i, and it is full, then try i+1, i+2, until an empty slot is found. The Linear Probing sequence is therefore shown below. H(K,0) = h(K) H(K, p+1) = (H(K,p) + 1) mod m

Open Addressing(Double Hashing) Here we use a second hash function h 2 to determine the probe sequence. Note that linear probing is just h 2 (K)=1 H(K,0) = h(K) H(K, p+1) = (H(K,p) + h 2 (K)) mod m To ensure that the probe sequence visits all positions in the table h 2 (K) must be greater than zero and relatively prime to m for any K. Hence we usually let m be prime.

Mathematical Defn’s Let us count as one probe each access to the data structure and n the size of the dictionary to be stored and m the size of the table. Let  =n/m be called the load factor. From a statistical stand point let S(  ) be the expected number of probes to perform a LookUp on a key that is actually in the hash table, and U(  ) be the expected number of probes in an unsuccessful LookUp.

Performance of Linear Probing 1 2S. Adams1 3 4J. Adams1 5W. Floyd2 6T. Heyward1 7 8J. Hancock1 9 10C.Braxton1 11J. Hart1 12J. Hewes3 13 14C. Carroll1 15A. Clark1 16R. Ellery2 17B. Franklin1 18W. Hooper5 key # probes The names were inserted in alphabetical order. The third column shows the number of probes required to find a key in the table; this is the same as the number of slots that were inspected when it was inserted. primary clustering S = 21/13=1.615 U = 47/18=2.6 load is 13/18=.72

Linear Probing Knuth analyzed sequential probing and obtained the following.  =.9 Sn Un Linear 5.5 50.5 Double 2.56 10.0 The small extra effort required to implement double hashing certainly pays off it seems.

Performance of Double Hashing Assumption: each probe inot the hash table is independent and has a probability of hitting an occupied position exactly equal to the load factor. Let  i = i/m for every i  n. This is then the probability of a collision on any probe after i keys have been inserted. The expected number of probes in an unsuccessful search when n- 1 items have been inserted is U n-1  1(1-  n-1 )+2  n-1 (1-  n-1 )+3  n-1 2 (1-  n-1 )+... = 1 +  n-1 +  n-1 2 +... = 1/(1-  n-1 ) Prob. that third probe is empty and first 2 filled

Successful Searching(Double) The # of probes in a successful search is the average of the number of probes it took to insert each of the n items. The expected number of probes to insert the ith item is the expected # of probes in an unsuccessful search when i-1 items have been inserted.

Successful Search Continued Where So

Observation(Double Hashing) When  n = 20/31 we have S n  1.606 Even when the table is 90% full S n is 2.56 although U n  1/1-.9 = 10.0

Chaining l Chaining puts elements that hash to the same slot in a linked list: —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we insert an element? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we delete an element? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we search for a element with a given key? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table: the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key?

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  )

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  ) l What will be the average cost of a successful search?

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  ) l What will be the average cost of a successful search? A: O(1 +  /2) = O(1 +  )

Analysis of Chaining Continued l So the cost of searching = O(1 +  ) l If the number of keys n is proportional to the number of slots in the table, what is  ? l A:  = O(1) n In other words, we can make the expected cost of searching constant if we make  constant

Choosing A Hash Function l Clearly choosing the hash function well is crucial n What will a worst-case hash function do? n What will be the time to search in this case? l What are desirable features of the hash function? n Should distribute keys uniformly into slots n Should not depend on patterns in the data

Hash Functions: The Division Method l h(k) = k mod m n In words: hash k into a table with m slots using the slot given by the remainder of k divided by m l What happens to elements with adjacent values of k? l What happens if m is a power of 2 (say 2 P )? l What if m is a power of 10? l Upshot: pick table size m = prime number not too close to a power of 2 (or 10)

Hash Functions: The Multiplication Method l For a constant A, 0 < A < 1: l h(k) =  m (kA -  kA  )  What does this term represent?

Hash Functions: The Multiplication Method l For a constant A, 0 < A < 1: l h(k) =  m (kA -  kA  )  l Choose m = 2 P l Choose A not too close to 0 or 1 l Knuth: Good choice for A = (  5 - 1)/2 Fractional part of kA

Hash Functions: Worst Case Scenario l Scenario: n You are given an assignment to implement hashing n You will self-grade in pairs, testing and grading your partner’s implementation n In a blatant violation of the honor code, your partner: u Analyzes your hash function u Picks a sequence of “worst-case” keys, causing your implementation to take O(n) time to search l What’s an honest CS student to do?

Hash Functions: Universal Hashing l As before, when attempting to foil an malicious adversary: randomize the algorithm l Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored n Guarantees good performance on average, no matter what keys adversary chooses

Universal Hashing l Let  be a (finite) collection of hash functions n …that map a given universe U of keys… n …into the range {0, 1, …, m - 1}. l  is said to be universal if: n for each pair of distinct keys x, y  U, the number of hash functions h   for which h(x) = h(y) is |  |/m n In other words: u With a random hash function from , the chance of a collision between x and y (x  y) is exactly 1/m

Universal Hashing l Theorem 12.3: n Choose h from a universal family of hash functions n Hash n keys into a table of m slots, n  m n Then the expected number of collisions involving a particular key x is less than 1 n Proof: u For each pair of keys y, z, let c yx = 1 if y and z collide, 0 otherwise u E[c yz ] = 1/m (by definition) u Let C x be total number of collisions involving key x u u Since n  m, we have E[C x ] < 1

A Universal Hash Function l Choose table size m to be prime l Decompose key x into r+1 bytes, so that x = {x 0, x 1, …, x r } n Only requirement is that max value of byte < m n Let a = {a 0, a 1, …, a r } denote a sequence of r+1 elements chosen randomly from {0, 1, …, m - 1} n Define corresponding hash function h a   : n With this definition,  has m r+1 members

A Universal Hash Function l  is a universal collection of hash functions (Theorem 12.4) l How to use: n Pick r based on m and the range of keys in U n Pick a hash function by (randomly) picking the a’s n Use that hash function on all keys

The End

Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:

Similar presentations

Presentation on theme: "Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:

Similar presentations

Presentation on theme: "Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:"— Presentation transcript:

Similar presentations

About project

Feedback