CS 3343: Analysis of Algorithms

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Hash Tables CIS 606 Spring 2010.
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Data Structures – LECTURE 11 Hash tables
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Universal Hashing When attempting to foil an malicious adversary, randomize the algorithm Universal hashing: pick a hash function randomly when the algorithm.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Hashing Hashing is another method for sorting and searching data.
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
David Luebke 1 3/19/2016 CS 332: Algorithms Augmenting Data Structures.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Hashing (part 2) CSE 2011 Winter March 2018.
Many slides here are based on E. Demaine , D. Luebke slides
Hash table CSC317 We have elements with key and satellite data
CSCI 210 Data Structures and Algorithms
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Dynamic Order Statistics
EEE2108: Programming for Engineers Chapter 8. Hashing
Hashing Alexandra Stefan.
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Dictionaries and Their Implementations
Introduction to Algorithms 6.046J/18.401J
Hash Tables – 2 Comp 122, Spring 2004.
CSCE 3110 Data Structures & Algorithm Analysis
Algorithms and Data Structures Lecture VI
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Introduction to Algorithms
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Space-for-time tradeoffs
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
EE 312 Software Design and Implementation I
CS 5243: Algorithms Hash Tables.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Slide Sources: CLRS “Intro. To Algorithms” book website
DATA STRUCTURES-COLLISION TECHNIQUES
DATA STRUCTURES-COLLISION TECHNIQUES
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Lecture-Hashing.
Presentation transcript:

CS 3343: Analysis of Algorithms Hash tables

Hash Tables Motivation: symbol tables A compiler uses a symbol table to relate symbols to associated data Symbols: variable names, procedure names, etc. Associated data: memory location, call graph, etc. For a symbol table (also called a dictionary), we care about search, insertion, and deletion We typically don’t care about sorted order

Hash Tables More formally: The structure we will use is a hash table Given a table T and a record x, with key (= symbol) and associated satellite data, we need to support: Insert (T, x) Delete (T, x) Search(T, k) We want these to be fast, but don’t care about sorting the records The structure we will use is a hash table Supports all the above in O(1) expected time!

Hashing: Keys In the following discussions we will consider all keys to be (possibly large) natural numbers When they are not, have to interpret them as natural numbers. How can we convert ASCII strings to natural numbers for hashing purposes? Example: Interpret a character string as an integer expressed in some radix notation. Suppose the string is CLRS: ASCII values: C=67, L=76, R=82, S=83. There are 128 basic ASCII values. So, CLRS = 67·1283+76 ·1282+ 82·1281+ 83·1280 = 141,764,947.

Direct Addressing Suppose: The idea: The range of keys is 0..m-1 Keys are distinct The idea: Set up an array T[0..m-1] in which T[i] = x if x T and key[x] = i T[i] = NULL otherwise This is called a direct-address table Operations take O(1) time! So what’s the problem?

The Problem With Direct Addressing Direct addressing works well when the range m of keys is relatively small But what if the keys are 32-bit integers? Problem 1: direct-address table will have 232 entries, more than 4 billion Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be Solution: map keys to smaller range 0..m-1 This mapping is called a hash function

Hash Functions U – Universe of all possible keys. Hash function h: Mapping from U to the slots of a hash table T[0..m–1]. h : U  {0,1,…, m–1} With direct addressing, key k maps to slot A[k]. With hash tables, key k maps or “hashes” to slot T[h[k]]. h[k] is the hash value of key k.

Hash Functions Problem: collision |U| >> K & |U| >> m U (universe of keys) h(k1) k1 h(k4) k4 K (actual keys) k5 collision h(k2) = h(k5) k2 h(k3) k3 m - 1 Problem: collision

Resolving Collisions How can we solve the problem of collisions? Solution 1: chaining Solution 2: open addressing

Open Addressing Basic idea (details in Section 11.4): To insert: if slot is full, try another slot (following a systematic and consistent strategy), …, until an open slot is found (probing) To search, follow same sequence of probes as would be used when inserting the element If reach element with correct key, return it If reach a NULL pointer, element is not in table Good for fixed sets (adding but no deletion) Example: file names on a CD-ROM Table needn’t be much bigger than n

Chaining Chaining puts elements that hash to the same slot in a linked list: T —— U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to insert an element? T —— U (universe of keys) k1 k4 —— K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to delete an element? Use a doubly-linked list for efficient deletion T —— U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Chaining How to search for a element with a given key? T —— U (universe of keys) k1 k4 —— —— k1 —— k4 K (actual keys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k8 k3 —— k6 k8 k6 —— ——

Hashing with Chaining Chained-Hash-Insert (T, x) Insert x at the head of list T[h(key[x])]. Worst-case complexity – O(1). Chained-Hash-Delete (T, x) Delete x from the list T[h(key[x])]. Worst-case complexity – proportional to length of list with singly-linked lists. O(1) with doubly-linked lists. Chained-Hash-Search (T, k) Search an element with key k in list T[h(k)]. Worst-case complexity – proportional to length of list.

Analysis of Chaining Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot What will be the average cost of an unsuccessful search for a key? A: (1+) (Theorem 11.1) What will be the average cost of a successful search? A: (2 + /2) = (1 + ) (Theorem 11.2)

Analysis of Chaining Continued So the cost of searching = O(1 + ) If the number of keys n is proportional to the number of slots in the table, what is ? A: n = O(m) =>  = n/m = O(1) In other words, we can make the expected cost of searching constant if we make  constant

Choosing A Hash Function Clearly, choosing the hash function well is crucial What will a worst-case hash function do? What will be the time to search in this case? What are desirable features of the hash function? Should distribute keys uniformly into slots Should not depend on patterns in the data

Hash Functions: The Division Method h(k) = k mod m In words: hash k into a table with m slots using the slot given by the remainder of k divided by m Example: m = 31 and k = 78, h(k) = 16. Advantage: fast Disadvantage: value of m is critical Bad if keys bear relation to m Or if hash does not depend on all bits of k What happens to elements with adjacent values of k? Elements with adjacent keys hashed to different slots: good What happens if m is a power of 2 (say 2P)? What if m is a power of 10? Pick m = prime number not too close to power of 2 (or 10)

Hash Functions: The Multiplication Method For a constant A, 0 < A < 1: h(k) = m (kA mod 1) =  m (kA - kA)  What does this term represent?

Hash Functions: The Multiplication Method For a constant A, 0 < A < 1: h(k) = m (kA mod 1) =  m (kA - kA)  Advantage: Value of m is not critical Disadvantage: relatively slower Choose m = 2P, for easier implementation What does this term represent? Fractional part of kA

How to choose A? The multiplication method works with any legal value of A. Choose A not too close to 0 or 1 Knuth: Good choice for A = (5 - 1)/2 Example: m = 1024, k = 123, A  0.6180339887… h(k) = 1024(123 · 0.6180339887 mod 1) = 1024 · 0.018169...  = 18.

Multiplication Method - Implementation Choose m = 2p, for some integer p. Let the word size of the machine be w bits. Assume that k fits into a single word. (k takes w bits.) Let 0 < s < 2w. (s takes w bits.) Restrict A to be of the form s/2w. Let k  s = r1 ·2w+ r0 . r1 holds the integer part of kA (kA) and r0 holds the fractional part of kA (kA mod 1 = kA – kA). We don’t care about the integer part of kA. So, just use r0, and forget about r1.

Multiplication Method – Implementation w bits k  s = A·2w binary point · r1 r0 extract p bits h(k) We want m (kA mod 1). m = 2p We could get that by shifting r0 to the left by p bits and then taking the p bits that were shifted to the left of the binary point. But, we don’t need to shift. Just take the p most significant bits of r0.

Hash Functions: Worst Case Scenario You are given an assignment to implement hashing You will self-grade in pairs, testing and grading your partner’s implementation In a blatant violation of the honor code, your partner: Analyzes your hash function Picks a sequence of “worst-case” keys that all map to the same slot, causing your implementation to take O(n) time to search Exercise 11.2-5: when |U| > nm, for any fixed hashing function, can always choose n keys to be hashed into the same slot.

Universal Hashing When attempting to defeat a malicious adversary, randomize the algorithm Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored pick a hash function randomly when the algorithm begins (not upon every insert!) Guarantees good performance on average, no matter what keys adversary chooses Need a family of hash functions to choose from

Universal Hashing Let  be a (finite) collection of hash functions …that map a given universe U of keys… …into the range {0, 1, …, m - 1}.  is said to be universal if: for each pair of distinct keys x, y  U, the number of hash functions h   for which h(x) = h(y) is at most ||/m In other words: With a random hash function from , the chance of a collision between x and y is at most 1/m (x  y)

Universal Hashing Theorem 11.3 (modified from textbook): Choose h from a universal family of hash functions Hash n keys into a table of m slots, n  m Then the expected number of collisions involving a particular key x is less than 1 Proof: For each pair of keys y, x, let cyx = 1 if y and x collide, 0 otherwise E[cyx] <= 1/m (by definition) Let Cx be total number of collisions involving key x Since n  m, we have E[Cx] < 1 Implication, expected running time of insertion is (1)

A Universal Hash Function Choose a prime number p that is larger than all possible keys Choose table size m ≥ n Randomly choose two integers a, b, such that 1  a  p -1, and 0  b  p -1 ha,b(k) = ((ak+b) mod p) mod m Example: p = 17, m = 6 h3,4 (8) = ((3*8 + 4) % 17) % 6 = 11 % 6 = 5

A universal hash function Theorem 11.5: The family of hash functions Hp,m = {ha,b} defined on the previous slide is universal Proof sketch: For any two distinct keys x, y, for a given ha,b, Let r = (ax+b) % p, s = (ay+b) % p. Can be shown that rs, and different (a,b) results in different (r,s) x and y collides only when r%m = s%m For a given r, the number of values s such that r%m = s%m and r  s is at most (p-1)/m For a given r, and any randomly chosen s, prob(r  s & r%m = s%m) = (p-1) / m / (p-1) = 1/m