Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Hash Tables CIS 606 Spring 2010.
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Data Structures – LECTURE 11 Hash tables
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
Dictionaries and Hash Tables1  
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 9 Hash Tables (continued) Reminder Examples.
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing1 Hashing. hashing2 Observation: We can store a set very easily if we can use its keys as array indices: A: e.g. SEARCH(A,k) return A[k]
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
David Luebke 1 3/19/2016 CS 332: Algorithms Augmenting Data Structures.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Introduction to Algorithms 6.046J/18.401J
Hash Tables – 2 Comp 122, Spring 2004.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CS 5243: Algorithms Hash Tables.
CS 3343: Analysis of Algorithms
Data Structures and Algorithm Analysis Hashing
Hash Tables – 2 1.
Presentation transcript:

Data Structures Hash Tables

Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols: variable names, procedure names, etc. u Associated data: memory location, call graph, etc. n For a symbol table (also called a dictionary), we care about search, insertion, and deletion n We typically don’t care about sorted order

Hash Tables l More formally: n Given a table T and a record x, with key (= symbol) and satellite data, we need to support: u Insert (T, x) u Delete (T, x) u Search(T, x) n We want these to be fast, but don’t care about sorting the records l The structure we will use is a hash table n Supports all the above in O(1) expected time!

Hashing: Keys l In the following discussions we will consider all keys to be (possibly large) natural numbers l How can we convert floats to natural numbers for hashing purposes? l How can we convert ASCII strings to natural numbers for hashing purposes?

Hashing a String Let A i represent the ascii code for a string character. Since ascii codes go from 0 to 127 we can perform the following calculation. A 3 X 3 + A 2 X 2 + A 1 X 1 +A 0 X 0 = (((A 3 )X + A 2 )X + A 1 )X + A 0 What is this called? Note that if X = 128 we can easily overflow. Solution: public static int hash(String key, int tableSize) { int hashVal = 0; for(int i=0; i < key.length(); i++) hashVal = ( hashVal * key.charAt(i))%tableSize; return hashVal; }

A better version Hash functions should be fast and distribute the keys equitably! The 37 slows down the shifting of lower keys and we allow overflow to occur(sort of auto mod-ing) which can result in a neg. public static int hash(String key, int tableSize) { int hashVal = 0; for(int i=0; i < key.length(); i++) hashVal = hashVal * 37 + key.charAt(i); hashVal %= tableSize; if( hashVal < 0)hashVal += tableSize;// fix the sign return hashVal; }

Direct Addressing l Suppose: n The range of keys is 0..m-1 n Keys are distinct l The idea: n Set up an array T[0..m-1] in which u T[i] = xif x  T and key[x] = i u T[i] = NULLotherwise n This is called a direct-address table u Operations take O(1) time! u So what’s the problem?

The Problem With Direct Addressing l Direct addressing works well when the range m of keys is relatively small l But what if the keys are 32-bit integers? n Problem 1: direct-address table will have 2 32 entries, more than 4 billion n Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be l Solution: map keys to smaller range 0..m-1 l This mapping is called a hash function

Hash Functions l Next problem: collision T 0 m - 1 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys)

Resolving Collisions l How can we solve the problem of collisions? l Solution 1: chaining l Solution 2: open addressing

Open Addressing l Basic idea: n To insert: if slot is full, try another slot, …, until an open slot is found (probing) n To search, follow same sequence of probes as would be used when inserting the element u If reach element with correct key, return it u If reach a NULL pointer, element is not in table l Good for fixed sets (adding but no deletion) n Example: spell checking l Table needn’t be much bigger than n

Open Addressing(Linear Probing) Let H(k,p) denote the p th position tried for key K where p=0,1,... Thus H(k,0)= h(k) where h is the basic hash function. H(k,p) represents the probe sequence we use to find the key. If key K hashes to index i, and it is full, then try i+1, i+2, until an empty slot is found. The Linear Probing sequence is therefore shown below. H(K,0) = h(K) H(K, p+1) = (H(K,p) + 1) mod m

Open Addressing(Double Hashing) Here we use a second hash function h 2 to determine the probe sequence. Note that linear probing is just h 2 (K)=1 H(K,0) = h(K) H(K, p+1) = (H(K,p) + h 2 (K)) mod m To ensure that the probe sequence visits all positions in the table h 2 (K) must be greater than zero and relatively prime to m for any K. Hence we usually let m be prime.

Mathematical Defn’s Let us count as one probe each access to the data structure and n the size of the dictionary to be stored and m the size of the table. Let  =n/m be called the load factor. From a statistical stand point let S(  ) be the expected number of probes to perform a LookUp on a key that is actually in the hash table, and U(  ) be the expected number of probes in an unsuccessful LookUp.

Performance of Linear Probing 1 2S. Adams1 3 4J. Adams1 5W. Floyd2 6T. Heyward1 7 8J. Hancock1 9 10C.Braxton1 11J. Hart1 12J. Hewes C. Carroll1 15A. Clark1 16R. Ellery2 17B. Franklin1 18W. Hooper5 key # probes The names were inserted in alphabetical order. The third column shows the number of probes required to find a key in the table; this is the same as the number of slots that were inspected when it was inserted. primary clustering S = 21/13=1.615 U = 47/18=2.6 load is 13/18=.72

Linear Probing Knuth analyzed sequential probing and obtained the following.  =.9 Sn Un Linear Double The small extra effort required to implement double hashing certainly pays off it seems.

Performance of Double Hashing Assumption: each probe inot the hash table is independent and has a probability of hitting an occupied position exactly equal to the load factor. Let  i = i/m for every i  n. This is then the probability of a collision on any probe after i keys have been inserted. The expected number of probes in an unsuccessful search when n- 1 items have been inserted is U n-1  1(1-  n-1 )+2  n-1 (1-  n-1 )+3  n-1 2 (1-  n-1 )+... = 1 +  n-1 +  n = 1/(1-  n-1 ) Prob. that third probe is empty and first 2 filled

Successful Searching(Double) The # of probes in a successful search is the average of the number of probes it took to insert each of the n items. The expected number of probes to insert the ith item is the expected # of probes in an unsuccessful search when i-1 items have been inserted.

Successful Search Continued Where So

Observation(Double Hashing) When  n = 20/31 we have S n  Even when the table is 90% full S n is 2.56 although U n  1/1-.9 = 10.0

Chaining l Chaining puts elements that hash to the same slot in a linked list: —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we insert an element? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we delete an element? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining l How do we search for a element with a given key? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table: the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key?

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  )

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  ) l What will be the average cost of a successful search?

Analysis of Chaining l Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot l Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot l What will be the average cost of an unsuccessful search for a key? A: O(1+  ) l What will be the average cost of a successful search? A: O(1 +  /2) = O(1 +  )

Analysis of Chaining Continued l So the cost of searching = O(1 +  ) l If the number of keys n is proportional to the number of slots in the table, what is  ? l A:  = O(1) n In other words, we can make the expected cost of searching constant if we make  constant

Choosing A Hash Function l Clearly choosing the hash function well is crucial n What will a worst-case hash function do? n What will be the time to search in this case? l What are desirable features of the hash function? n Should distribute keys uniformly into slots n Should not depend on patterns in the data

Hash Functions: The Division Method l h(k) = k mod m n In words: hash k into a table with m slots using the slot given by the remainder of k divided by m l What happens to elements with adjacent values of k? l What happens if m is a power of 2 (say 2 P )? l What if m is a power of 10? l Upshot: pick table size m = prime number not too close to a power of 2 (or 10)

Hash Functions: The Multiplication Method l For a constant A, 0 < A < 1: l h(k) =  m (kA -  kA  )  What does this term represent?

Hash Functions: The Multiplication Method l For a constant A, 0 < A < 1: l h(k) =  m (kA -  kA  )  l Choose m = 2 P l Choose A not too close to 0 or 1 l Knuth: Good choice for A = (  5 - 1)/2 Fractional part of kA

Hash Functions: Worst Case Scenario l Scenario: n You are given an assignment to implement hashing n You will self-grade in pairs, testing and grading your partner’s implementation n In a blatant violation of the honor code, your partner: u Analyzes your hash function u Picks a sequence of “worst-case” keys, causing your implementation to take O(n) time to search l What’s an honest CS student to do?

Hash Functions: Universal Hashing l As before, when attempting to foil an malicious adversary: randomize the algorithm l Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored n Guarantees good performance on average, no matter what keys adversary chooses

Universal Hashing l Let  be a (finite) collection of hash functions n …that map a given universe U of keys… n …into the range {0, 1, …, m - 1}. l  is said to be universal if: n for each pair of distinct keys x, y  U, the number of hash functions h   for which h(x) = h(y) is |  |/m n In other words: u With a random hash function from , the chance of a collision between x and y (x  y) is exactly 1/m

Universal Hashing l Theorem 12.3: n Choose h from a universal family of hash functions n Hash n keys into a table of m slots, n  m n Then the expected number of collisions involving a particular key x is less than 1 n Proof: u For each pair of keys y, z, let c yx = 1 if y and z collide, 0 otherwise u E[c yz ] = 1/m (by definition) u Let C x be total number of collisions involving key x u u Since n  m, we have E[C x ] < 1

A Universal Hash Function l Choose table size m to be prime l Decompose key x into r+1 bytes, so that x = {x 0, x 1, …, x r } n Only requirement is that max value of byte < m n Let a = {a 0, a 1, …, a r } denote a sequence of r+1 elements chosen randomly from {0, 1, …, m - 1} n Define corresponding hash function h a   : n With this definition,  has m r+1 members

A Universal Hash Function l  is a universal collection of hash functions (Theorem 12.4) l How to use: n Pick r based on m and the range of keys in U n Pick a hash function by (randomly) picking the a’s n Use that hash function on all keys

The End