Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables.
Hash Tables CIS 606 Spring 2010.
Hashing.
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Tirgul 9 Hash Tables (continued) Reminder Examples.
Hash Tables1 Part E Hash Tables  
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CS201: Data Structures and Discrete Mathematics I Hash Table.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hash functions Open addressing
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Introduction to Algorithms 6.046J/18.401J
Hashing Alexandra Stefan.
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Data Structures – Week #7
CS 3343: Analysis of Algorithms
Presentation transcript:

Tirgul 7

Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys belong to a universal group of keys, U = {1... M}. Direct Address Tables – An array of size m. An Element with key i is mapped to cell i. Time-complexity: O(1) What might be the problem? If the range of the keys is much larger than the number of elements a lot of space is wasted - might be too large to be practical.

Hash Tables In a hash table, we allocate an array of size m, which is much smaller than |U| (the set of keys). We use a hash function h() to determine the entry of each key. What property should a good hash function have? The crucial point: the hash function should “spread” the keys of U equally among all the entries of the array. An example of a hash function:

Hash choices Assume we want to hash strings for a country club managing software, each string contains full name (e.g. “Parker Anthony”). Suggested hash function : the ascii number of the first symbol in the string (e.g. h(“Parker Anthony”) = ascii of ‘h’ = 80. Is this a good hash function? No, since it is likely that many families register together. This means that we will have many collisions even with few entries. We would like to have something “random looking”. The exact meaning of this will be explained next week, however, it is important to think of this point when choosing a hash function.

How to choose hash functions Can we choose a hash function that will spread the keys uniformly between the cells? Unfortunately, we don’t know in advance what keys will be received, so there can always be a ‘bad input’. We can try to find a hash function that usually maps a small number of keys to the same cell. Two common methods: - The division method - The multiplication method

The division method Denote by m the size of the table. The division method is : h(k) = k mod m What is the value of m we should choose? For example: |U|=2000, if we want each search to take three operations (on the average), then choose a number close to 2000/3 ≈ 666. For technical reasons it is preferable to choose m as a prime (for a prime p Z p is a field). In our example we choose m = 701.

The multiplication method The multiplication method: –Multiply k by a constant 0<A<1. –Take the fractional part of kA –Multiply by m. –Formally, The multiplication method does not depends as much on m since A helps randomizing the hash function. Which values of A are good choices? Which are bad choices?

The multiplication method A bad choice of A, example: –if m = 100 and A=1/3, then –for k=10, h(k)=33, –for k=11, h(k)=66, –And for k=12, h(k)=99. –This is not a good choice of A, since we’ll have only three values of h(k)... The optimal choice of A depends on the keys themselves. Knuth claims that is likely to be a good choice.

What if keys are not numbers? The hash functions we showed only work for numbers. When keys are not numbers,we should first convert them to numbers. How can we convert a string into a number? A string can be treated as a number in base 256. –Each character is a digit between 0 and 255. The string “key” will be translated to

Translating long strings to numbers The disadvantage of the conversion is: –A long string creates a large number. –Strings longer than 4 characters would exceed the capacity of a 32 bit integer. How can this problem be solved when using the division method? We can write the integer value of “word” as (((w* o)*256 + r)*256 + d) When using the division method the following facts can be used: –(a+b) mod n = ((a mod n)+b) mod n –(a*b) mod n = ((a mod n)*b) mod n.

Translating long strings to numbers The expression we reach is: –((((((w*256+o)mod m)*256)+r)mod m)*256+d)mod m Using the properties of mod, we get the simple alg.: int hash(String s, int m) int h=s[0] for ( i=1 ; i<s.length ; i++) h = ((h*256) + s[i])) mod m return h Notice that h is always smaller than m. This will also improve the performance of the algorithm.

The Birthdays Paradox There are 28 people in the room. Assuming that birthdays are distributed uniformly over the year, how many pairs of people who share birthdays will you expect? Surprisingly, the expected number of such pairs is approximately 1. Analysis: There are n people. The probability that two people share birthdays is 1/365. The number of pairs is So – the expected number of pairs with the same birthday is

Collisions Can several keys have the same entry? –Yes, since U is much larger than m. Collision - several elements are mapped into the same cell. Similar to the birthdays paradox analysis. When are collisions more likely to happen? When the hash table is almost full. We define the “load factor” as α = n/m. –n - the number of keys in the hash table –m - the size of the table How can we solve collisions?

Chaining There are two approaches to handle collisions: –Chaining. –Open Addressing. Chaining: –Each entry in the table is a linked list. –The linked list holds all the keys that are mapped to this entry. Search operation on a hash table which applies chaining takes time.

Chaining This complexity is calculated under the assumption of uniform hashing. Notice that in the chaining method, the load factor may be greater than one. Analysis: Insert - O(1) Unsuccessful search - calculating the hash value – O(1), Scanning the linked-list – expected O(α). Total: O(1 + α). Successful search – Let x be the i’th element inserted to the hash table. When searching for x, the expected number of scanned elements is (i-1)/m. For any element - the expected number of scanned elements is O(α). Total: O(1 + α).

Open addressing In this method, the table itself holds all the keys. We change the hash function to receive two parameters: –The first is the key. –The second is the probe number. We first try to locate h(k,0) in the table. If it fails we try to locate h(k,1) in the table, and so on.

Open addressing It is required that {h(k, 0),...,h(k,m-1)} will be a permutation of {0,..,m-1}. After m-1 probes we’ll definitely find a place to locate k (unless the table is full). Notice that here, the load factor must be smaller than one. There is a problem with deleting keys. What is it? How can it be solved?

Open addressing While searching key i and reaching an empty slot, we don’t know if: –The key i doesn’t exist in the table. –Or, key i does exist in the table but at the time key i was inserted this slot was occupied, and we should continue our search. What functions can we use to implement open addressing? We will discuss two ways : –linear probing –double hashing

Open addressing Linear probing - h(k,i)=(h(k)+i) mod m –The problem: primary clustering. –If several consecutive slots are occupied, the next free slot has high probability of being occupied. –Search time increases when large clusters are created. –The reason for the primary clustering stems from the fact that there are only m different probe sequences. –How can we change the hash function to avoid clusters?

Open addressing Double hashing – h(k,i)=(h 1 (k)+ih 2 (k)) mod m –Better than linear probing. –It’s best if for all k we have that h 2 (k) does not have a common divisor with m. – That way different probe sequences!

Performance Insertion and unsuccessful search : 1/(1- α) probes on average. Intuition: We always check at least one cell. In probability α this cell is occupied. In probability ≈ α 2 (n/m * (n-1)/m) the next cell is also occupied. Expected probes number ≈ 1 + α + α 2 + α 3 +… = 1/(1- α). A successful search: (1/ α)ln(1/(1- α)) probes on average Idea: When searching for x, the probes number is the same as when the element was inserted (depends on the number of elements that were inserted before x). How to get to the formula look in CLR. For example: –If the table is 50% full then a search will take about 1.4 probes on average. –If the table 90% full then the search will take about 2.6 probes on average.

Example for Open Addressing Lets look at an example: First assume we use linear probing insert the numbers: 4, 28, 6, 38, 26 with m=11 and h(k)=k mod m. [ ] [ ] [ ] [ ] [4] [38] [28] [6] [26] [ ] [ ] Note the clustering effect on 26!

Example for Open Addressing Now lets try double hashing, again m=11 Use: h 1 (k)=k mod m h 2 (k)=1 + k mod (m-1). h(k)= ( h 1 (k) + i*h 2 (k) ) mod m Insert numbers: 4, 28, 6, 38, 26 [26] [ ] [6] [ ] [4] [38] [28] [ ] [ ] [ ] [ ] The clustering effect disapeard.

When should hash tables be used Hash tables are very useful for implementing dictionaries if we don’t have an order on the elements, or we have order but we need only the standard operations. On the other hand, hash tables are less useful if we have order and we need more than just the standard operations. –For example, last(), or iterator over all elements, which is problematic if the load factor is very low.

When should hash tables be used We should have a good estimate of the number of elements we need to store –For example, the huji has about 30,000 students each year, but still it is a dynamic d.b. Re-hashing: If we don’t know a-priori the number of elements, we might need to perform re-hashing, increasing the size of the table and re-assigning all elements.