Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables.
Hash Tables CIS 606 Spring 2010.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Data Structures – LECTURE 11 Hash tables
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing1 Hashing. hashing2 Observation: We can store a set very easily if we can use its keys as array indices: A: e.g. SEARCH(A,k) return A[k]
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Hash Table.
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Dictionaries and Their Implementations
Introduction to Algorithms 6.046J/18.401J
Hash Tables – 2 Comp 122, Spring 2004.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
CS 3343: Analysis of Algorithms
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Lecture-Hashing.
Presentation transcript:

Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a symbol table in a compiler, cache memory. With an ordinary array, we store an element k at position k of the array. Given a key k, we find the element by looking in the kth position of the array. This is known as direct addressing. Direct addressing is applicable when we can afford to allocate an array with one position for every possible key.

Direct-address tables Each element has a key drawn from a universe U={0,1,…,m-1} and no two keys are equal. Direct-addressing works well when the universe U of keys is reasonably small.

Approach 1: Direct-Addressing Direct-Address Table Assume U = {0, 1, 2,..., m}. The data structure is an array T[0...m] - Insert(x) : T[key[x]] := x - Search(k) : return T[k] - Delete(x) : T[key[x]] := NIL Running Times: (assume n elements in table) - Insert(x) : O(1) time - Search(k) : O(1) time - Delete(x) : O(1) time Great running time! (can’t do any better) Space Usage: - O(m) space always! - good if n =  (m) - bad if n << m

Hash Tables U is the "key space", the set of all possible keys K  U is the set of keys being used A hash table is an ordinary array, but it typically uses a size proportional to the number of keys to be stored, not to the key space. Use a hash table when K is small relative to U. IDEA: Given a key, k, don’t just use k as the index into the array. Instead, compute a function of k, and use that value to index into the array. We call this function a hash function. Goals: fast implementation of all operations -- O(1) time space efficient data structure -- O(n) space if n elements in dictionary

Approach 2: Hashing Hashing - hash table (a 1D array) H[0..m-1], where m << |U| and m is prime - hash function h is a deterministic mapping of keys to indices in H h : U  {0, 1,..., m-1} Problem: there will be some collisions; that is, h will map some keys to the same position in H (i.e., h(k 1 ) = h(k 2 ) for k 1 ≠ k 2 ). Different methods of resolving collisions: 1.chaining: use a general hash function and put all elements that hash to same location in a linked list at that location (as is done in bucket sort). 2.open addressing: use a general hash function as in chaining, and then increment the original position until an empty slot (or the element you are looking for) is found. One index position for each element in table.

Hash Functions The mapping of keys to indexes of a hash table is called a hash function –An essential requirement of the hash function is to map equal keys to equal indexes –A “good” hash function minimizes the probability of collisions Purpose of a hash function is to translate an extremely large key space into a reasonably small range of integers, i.e., to map each key k in our dictionary to a position in the hash table

Choosing Hash Functions Ideally, a hash function satisfies the Simple Uniform Hashing Assumption. Simple Uniform Hashing (SUH): Any key is equally likely to hash to any location (index) in hash table.

Choosing Hash Functions Realistically, it’s usually not possible to satisfy SUH requirement because we don’t know in advance the probability distribution the keys are drawn from. We often use heuristics based on the domain of the keys to create a hash function that performs well. Hash functions assume that keys are natural numbers. When they’re not, have to find a way to interpret them as natural numbers.

Hash-Code Maps 1) Component sum: for numeric types with more than 32 bits, we can add the 32-bit components, i.e., sum the high-order bits with the low-order bits. Integer result is the hash code. A hash function assigns each key k in the dictionary an integer value, called the hash code or hash value. An essential feature of a hash code is consistency, i.e., it should map all items with key k to the same integer. Common hash codes:

Hash-Code Maps 2) Polynomial accumulation: for strings of a natural language, combine the character values (ASCII or Unicode) a 0 a 1... a n-1 by viewing them as the coefficients of a polynomial: a 0 + a 1 x + a 2 x a n-1 x n-1 Common hash codes (cont.): Example: Interpret a character string as an integer expressed in some radix notation. Suppose the string is CLRS: ASCII values: C = 67, L = 76, R = 82, S = 83 There are 128 basic ASCII values. So interpret CLRS as (67 X ) + (76 X ) + (82 X ) + (83 X ) = 141,764,947

Compression Maps Division method: Take the integer produced as a hash code and mod it by the table size. h(k) = k mod m –The table size m is usually chosen as a prime number to help “spread out” the distribution of hashed values –The prime number should not be close to an exact power of 2. Normally, the range of possible hash codes generated for a set of keys will exceed the range of the array. Thus, we need a way to map this integer into the range [0, m-1].

Compression Maps Multiplication method: 1.Choose constant A in range 0 < A < 1 2.Multiply key k by A. 3.Extract the fractional part of kA. 4.Multiply the fractional part by m. 5.Truncate (take the floor of) the result Advantages of this method are that the value of m is not critical (need not be a prime number). Disadvantage is that it takes longer to compute.

Collision Resolution by Chaining Running Times: (assume n elements in list) - Insert(x) : O(1) time to insert node at head of list - Search(k) : Proportional to length of list T[h(k)] - Delete(x) : Proportional to length of list T[h(k)]

Collision Resolution by Chaining m H: x y i... h(x) = h(y) = i m-1m-2 Hash function is calculated mod m, so every key maps into H.

Efficiency of search with chaining Depends on the lengths of the linked lists, which depend on the table size, the number of dictionary elements, and the quality of the hash function. If hash function distributes n keys among m cells of the hash table about evenly, each list will have about n/m keys. Ratio  = n/m is called the load factor. The average number of pointers inspected in unsuccessful searches is 1 +  and in successful searches is 1 +  /2.

Collision Resolution by Chaining Analysis is in terms of the load factor α = n/m: n = # elements in table m = # slots in the table Load factor is average number of items per linked list With chaining, we can have α 1 If n = O(m), then α = n/m = O(m)/m = O(1), which means that searching takes constant time on average.

Average-Case Analysis for Search(x) with Collision Resolution by Chaining Average search time: Assuming simple uniform hashing: let H = [0...m] let n be the number of elements currently in H  = n/m (the load factor of H)  (1) to compute h(x) + average time to examine list H[h(x)] for x =  (1) +  (average length of list) =  (1) +  (n/m) =  (1 +  )

Collision Resolution by Open Addressing Alternative to chaining for handling collisions. Main Idea: -All elements are stored in the hash table array itself. No linked lists. -Each array position contains either a key or NIL. -Idea: Compute a hash function as in the chaining scheme, but if collision occurs, successively index into (i.e., probe) H until we find x, or an empty slot for x. The sequence in which slots are probed depends on both the key of the element and the probe increment.

Collision Resolution by Open Addressing In this method, the hash function includes the probe number (i.e., how many attempts have been made to find a slot for this key) as an argument. - The probe sequence for key k = h(k,0), h(k,1),..., h(k,m-1) - In the worst case, every slot in table will be examined, so stop either when the item with key k is found or when a slot with NIL is found What might happen if we delete elements from H? Need to handle deletions by giving slot different status. Not straight-forward as in chaining.

Collision Resolution by Open Addressing Ideally, we have: Note: This is different from simple uniform hashing simple uniform hashing: each key hashes to each of m slots with probability 1/m. uniform hashing: each key hashes to each of m! probe sequences with probability 1/m. Problem: this is hard to implement, so we approximate it with techniques that guarantee the probe sequence is a permutation of {0,1,…,m-1} Uniform Hashing assumption: Each key is equally likely to have any of the m! permutations (of indices in H) as its probe sequence.

Open Addressing with Linear Probing Linear Probing: Simplest rehashing functions (e.g., + 1 each probe). The ith probe h(k, i) is h(k, i) = (h'(k) + i) mod m h'(k) is ordinary hashing function, tells where to start the search. search sequentially through table (with wrap around) from starting point. How many distinct probe sequences are there? m each starting point gives a probe sequence there are m starting points therefore, this method is not uniform (need m! probe sequences)

Linear Probing Example h(k, i) = (h(k) + i ) mod m (i is probe number, initially, i = 0) Insert keys: (in that order) If a collision occurs, when j = h(k), we try next at A[(j+1)mod m], then A[(j+2)mod m], and so on. When an empty position is found the item is inserted. Linear probing is easy to implement, but leads to clustering (long run of occupied slots). Clusters can lead to poor performance, both for inserting and finding keys. How many collisions occur in this case? h(k) = k mod 13 m = 13 11

Open Addressing with Quadratic Probing Quadratic Probing: the ith probe h(k,i) is h(k, i) = (h'(k) + c 1 i + c 2 i 2 ) mod m c 1 and c 2 are constants h'(k) is ordinary hash function, tells where to start the search later probes are offset by an amount quadratic in i (the probe number). How many distinct probe sequences are there? m each starting point gives a probe sequence there are m starting points therefore, not uniform (need m! probe sequences)

Quadratic Probing Insert keys: (in that order) How many collisions occur in this case? % 13 = 5 (collision), next try: (5 + 2   1 2 ) % 13 = 10 h(k) = k mod 13 m = 13 c 1 = 2, c 2 = % 13 = 5 (collision), next try: (5 + 2   1 2 ) % 13 = 10 % 13 = 10 (collision) next try: (5 + 2   2 2 ) % 13 = 21 % 13 = 8 73 % 13 = 8 (collision), next try: (8 + 2  1+ 3  1 2 ) % 13 = 0 h(k,i) = (h(k) + c 1  i + c 2  i 2 ) mod m

Open Addressing with Double Hashing Double Hashing: the ith probe h(k,i) is h(k, i) = (h 1 (k) + i h 2 (k) ) mod m h 1 (k) is ordinary hash function, tells where to start the search h 2 (k) is ordinary hash function that gives offset for subsequent probes. Note: to make the probe sequence hit all slots in H, we must have h 2 (k) relatively prime to m. How many distinct probe sequences are there? m 2 there are m starting points and each possible (h 1 (k), h 2 (k)) pair yields a distinct probe sequence starting point and offset can vary independently better, but still not uniform

Double Hashing Example h 1 (K) = K mod m h 2 (K) = K mod (m – 1) The i th probe is h(k, i) = (h 1 (k) + h 2 (k)  i) mod m h 2 is an offset to add Insert keys: (in that order) How many collisions occur in this case? 44 % 13 = 5 (collision), next try: (5 + (44 % 12)) % 13 = 13 % 13 = % 13 = 5 (collision), next try: (5 + (31 % 12)) % 13 = 12 % 13 = 12 m = 13

Analyzing Open Addressing Question: How large can the load factor be in a table using open addressing? Explain your answer.

Analyzing Open Addressing Lemma: Pr[at least i probes access occupied slots]   i Proof: Pr[at least i probes] = Pr[slot 1 full]  Pr[slot 2 full] ...  Pr[slot i - 1 full] = (n/m)  ((n-1)/(m-1)) ...  ((n - (i - 1))/((m - (i - 1))  (n/m) i =  i

Analyzing Open Addressing - Assume: uniform hashing (all m! probe sequences equally likely) -  = n/m (load factor), so we need  ≤ 1 (table cannot be overfilled). Theorem: If  < 1, then the expected number of probes in search is ≤ (1/(1-  )) assuming UH applies. Proof: In a search when  < 1 (table not full), some number of probes access occupied slots and the last probe accesses an empty slot or the key being searched for.

Analyzing Open Addressing E(# probes) = ∑ i=1 to  Pr{at least i probes} = ∑ i=1 to   i = 1/(1-  ) Thus, for example, we have: oif the hash table is half full, (  =.5), then the average number of probes in a search is (1/(1-.5)) = 2. oif the hash table is 90% full, (  =.9), then the average number of probes in a search is (1/(1-.9)) = 10. This theorem predicts that, if  is a constant, a search runs in O(1) time.

Approach 1: Linked Lists Linked List Implementation (Chaining) - Insert(x) : add x at head of list - Search(k) : start at head and scan list - Delete(x) : start at head, scan list, and then delete if found Running Times: (assume n elements in list) - Insert(x) : O(1) time - Search(k) : worst-case -- element at end of list: n operations average-case -- element at middle of list: n/2 ops best-case -- element at head of list: 1 op - Delete(x) : same as searching We'd like O(1) time for all operations, we have O(n) for two. Space Usage: O(n) space -- very space efficient.

Approach 2: Direct-Addressing Direct-Address Table Assume U = {0, 1, 2,..., m}. The data structure is an array T[0...m] - Insert(x) : T[key[x]] = x - Search(k) : return T[k] - Delete(x) : T[key[x]] = NIL Running Times: (assume n elements in list) - Insert(x) : O(1) time - Search(k) : O(1) time - Delete(x) : O(1) time Great running time! Space Usage: (assume n elements in list). - O(m) space always! - good if n =  (m) - bad if n << m

Chaining wins!! In general, a good hash function with an array of linked lists outperforms the open addressing approach. However, you may have to tailor the hash function to meet the requirements of your data set by experimentation. No one hash function is good for all data sets. Chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree.

Chaining wins!!

End of Hash Table Lecture