CS 3343: Analysis of Algorithms

CS 3343: Analysis of Algorithms
Hash tables

Hash Tables Motivation: symbol tables
A compiler uses a symbol table to relate symbols to associated data Symbols: variable names, procedure names, etc. Associated data: memory location, call graph, etc. For a symbol table (also called a dictionary), we care about search, insertion, and deletion We typically don’t care about sorted order

Hash Tables More formally: The structure we will use is a hash table
Given a table T and a record x, with key (= symbol) and associated satellite data, we need to support: Insert (T, x) Delete (T, x) Search(T, k) We want these to be fast, but don’t care about sorting the records The structure we will use is a hash table Supports all the above in O(1) expected time!

Hashing: Keys In the following discussions we will consider all keys to be (possibly large) natural numbers When they are not, have to interpret them as natural numbers. How can we convert ASCII strings to natural numbers for hashing purposes? Example: Interpret a character string as an integer expressed in some radix notation. Suppose the string is CLRS: ASCII values: C=67, L=76, R=82, S=83. There are 128 basic ASCII values. So, CLRS = 67· · · · = 141,764,947.

Direct Addressing Suppose: The idea: The range of keys is 0..m-1
Keys are distinct The idea: Set up an array T[0..m-1] in which T[i] = x if x T and key[x] = i T[i] = NULL otherwise This is called a direct-address table Operations take O(1) time! So what’s the problem?

The Problem With Direct Addressing
Direct addressing works well when the range m of keys is relatively small But what if the keys are 32-bit integers? Problem 1: direct-address table will have 232 entries, more than 4 billion Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be Solution: map keys to smaller range 0..m-1 This mapping is called a hash function

Hash Functions U – Universe of all possible keys.
Hash function h: Mapping from U to the slots of a hash table T[0..m–1]. h : U  {0,1,…, m–1} With direct addressing, key k maps to slot A[k]. With hash tables, key k maps or “hashes” to slot T[h[k]]. h[k] is the hash value of key k.

Hash Functions Problem: collision |U| >> K & |U| >> m
U (universe of keys) h(k1) k1 h(k4) k4 K (actual keys) k5 collision h(k2) = h(k5) k2 h(k3) k3 m - 1 Problem: collision

Resolving Collisions How can we solve the problem of collisions?
Solution 1: chaining Solution 2: open addressing

Open Addressing Basic idea (details in Section 11.4):
To insert: if slot is full, try another slot (following a systematic and consistent strategy), …, until an open slot is found (probing) To search, follow same sequence of probes as would be used when inserting the element If reach element with correct key, return it If reach a NULL pointer, element is not in table Good for fixed sets (adding but no deletion) Example: file names on a CD-ROM Table needn’t be much bigger than n

Chaining Chaining puts elements that hash to the same slot in a linked list: T —— U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to insert an element? T —— U (universe of keys) k1 k4 ——
K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to delete an element?
Use a doubly-linked list for efficient deletion T —— U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to search for a element with a given key? T ——
U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Hashing with Chaining Chained-Hash-Insert (T, x)
Insert x at the head of list T[h(key[x])]. Worst-case complexity – O(1). Chained-Hash-Delete (T, x) Delete x from the list T[h(key[x])]. Worst-case complexity – proportional to length of list with singly-linked lists. O(1) with doubly-linked lists. Chained-Hash-Search (T, k) Search an element with key k in list T[h(k)]. Worst-case complexity – proportional to length of list.

Analysis of Chaining Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot What will be the average cost of an unsuccessful search for a key? A: (1+) (Theorem 11.1) What will be the average cost of a successful search? A: (2 + /2) = (1 + ) (Theorem 11.2)

Analysis of Chaining Continued
So the cost of searching = O(1 + ) If the number of keys n is proportional to the number of slots in the table, what is ? A: n = O(m) =>  = n/m = O(1) In other words, we can make the expected cost of searching constant if we make  constant

Choosing A Hash Function
Clearly, choosing the hash function well is crucial What will a worst-case hash function do? What will be the time to search in this case? What are desirable features of the hash function? Should distribute keys uniformly into slots Should not depend on patterns in the data

Hash Functions: The Division Method
h(k) = k mod m In words: hash k into a table with m slots using the slot given by the remainder of k divided by m Example: m = 31 and k = 78, h(k) = 16. Advantage: fast Disadvantage: value of m is critical Bad if keys bear relation to m Or if hash does not depend on all bits of k What happens to elements with adjacent values of k? Elements with adjacent keys hashed to different slots: good What happens if m is a power of 2 (say 2P)? What if m is a power of 10? Pick m = prime number not too close to power of 2 (or 10)

Hash Functions: The Multiplication Method
For a constant A, 0 < A < 1: h(k) = m (kA mod 1) =  m (kA - kA)  What does this term represent?

Hash Functions: The Multiplication Method
For a constant A, 0 < A < 1: h(k) = m (kA mod 1) =  m (kA - kA)  Advantage: Value of m is not critical Disadvantage: relatively slower Choose m = 2P, for easier implementation What does this term represent? Fractional part of kA

How to choose A? The multiplication method works with any legal value of A. Choose A not too close to 0 or 1 Knuth: Good choice for A = (5 - 1)/2 Example: m = 1024, k = 123, A  … h(k) = 1024(123 · mod 1) = 1024 ·  = 18.

Multiplication Method - Implementation
Choose m = 2p, for some integer p. Let the word size of the machine be w bits. Assume that k fits into a single word. (k takes w bits.) Let 0 < s < 2w. (s takes w bits.) Restrict A to be of the form s/2w. Let k  s = r1 ·2w+ r0 . r1 holds the integer part of kA (kA) and r0 holds the fractional part of kA (kA mod 1 = kA – kA). We don’t care about the integer part of kA. So, just use r0, and forget about r1.

Multiplication Method – Implementation
w bits k  s = A·2w binary point r1 r0 extract p bits h(k) We want m (kA mod 1). m = 2p We could get that by shifting r0 to the left by p bits and then taking the p bits that were shifted to the left of the binary point. But, we don’t need to shift. Just take the p most significant bits of r0.

Hash Functions: Worst Case Scenario
You are given an assignment to implement hashing You will self-grade in pairs, testing and grading your partner’s implementation In a blatant violation of the honor code, your partner: Analyzes your hash function Picks a sequence of “worst-case” keys that all map to the same slot, causing your implementation to take O(n) time to search Exercise : when |U| > nm, for any fixed hashing function, can always choose n keys to be hashed into the same slot.

Universal Hashing When attempting to defeat a malicious adversary, randomize the algorithm Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored pick a hash function randomly when the algorithm begins (not upon every insert!) Guarantees good performance on average, no matter what keys adversary chooses Need a family of hash functions to choose from

Universal Hashing Let  be a (finite) collection of hash functions
…that map a given universe U of keys… …into the range {0, 1, …, m - 1}.  is said to be universal if: for each pair of distinct keys x, y  U, the number of hash functions h   for which h(x) = h(y) is at most ||/m In other words: With a random hash function from , the chance of a collision between x and y is at most 1/m (x  y)

Universal Hashing Theorem 11.3 (modified from textbook):
Choose h from a universal family of hash functions Hash n keys into a table of m slots, n  m Then the expected number of collisions involving a particular key x is less than 1 Proof: For each pair of keys y, x, let cyx = 1 if y and x collide, 0 otherwise E[cyx] <= 1/m (by definition) Let Cx be total number of collisions involving key x Since n  m, we have E[Cx] < 1 Implication, expected running time of insertion is (1)

A Universal Hash Function
Choose a prime number p that is larger than all possible keys Choose table size m ≥ n Randomly choose two integers a, b, such that 1  a  p -1, and 0  b  p -1 ha,b(k) = ((ak+b) mod p) mod m Example: p = 17, m = 6 h3,4 (8) = ((3*8 + 4) % 17) % 6 = 11 % 6 = 5

A universal hash function
Theorem 11.5: The family of hash functions Hp,m = {ha,b} defined on the previous slide is universal Proof sketch: For any two distinct keys x, y, for a given ha,b, Let r = (ax+b) % p, s = (ay+b) % p. Can be shown that rs, and different (a,b) results in different (r,s) x and y collides only when r%m = s%m For a given r, the number of values s such that r%m = s%m and r  s is at most (p-1)/m For a given r, and any randomly chosen s, prob(r  s & r%m = s%m) = (p-1) / m / (p-1) = 1/m

CS 3343: Analysis of Algorithms

Similar presentations

Presentation on theme: "CS 3343: Analysis of Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 3343: Analysis of Algorithms

Similar presentations

Presentation on theme: "CS 3343: Analysis of Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback