CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
CSCE 3400 Data Structures & Algorithm Analysis
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Nov 12, 2009IAT 8001 Hash Table Bucket Sort. Nov 12, 2009IAT 8002  An array in which items are not stored consecutively - their place of storage is calculated.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Hashing Techniques.
Data Structures Hash Tables
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Hashing Hashing is another method for sorting and searching data.
Hashing as a Dictionary Implementation Chapter 19.
Hash Tables - Motivation
CSC 427: Data Structures and Algorithm Analysis
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.
1 CSC 427: Data Structures and Algorithm Analysis Fall 2011 Space vs. time  space/time tradeoffs  hashing  hash table, hash function  linear probing.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
1 Data Structures CSCI 132, Spring 2014 Lecture 33 Hash Tables.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Open Addressing: Quadratic Probing
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hash Table.
Hash Tables.
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Presentation transcript:

CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better if we could use the key to index directly into the array. Let’s consider some fairly simple examples: Example: Suppose we are running a club and want to keep data (such as telephone number, address, etc.) for each member. If membership numbers are in the range 1..N, we can store data for member j in slot j-1 of an array. Hence the membership number ( the key ) can be used to index into the array and access the member’s data ( the value ) in O(1) time.

CPSC 252 Hashing Page 2 Example: Suppose we want to count the frequency with which characters appear in a file. We can use the character’s ASCII code ( the key ) to index into an array of frequencies ( the values ). In this case, our array of frequencies will have dimension 256 – one for each of the different characters. It could be argued that not all files will have occurances of all these different characters so we are potentially wasting memory. This is true but we have to realize that we gain the benefit of accessing frequencies in O(1) time. Suppose we make the call: int freq = getFrequency( ‘A’ ); Since the ASCII code for ‘A’ is 65, the getFrequency function will return the frequency found at array index 65.

CPSC 252 Hashing Page 3 In these first couple of examples, it has been easy to see how a key can be constructed that can be used to index into an array. In general the process is not so straightforward… Example: Suppose we want to keep track of VISA accounts. VISA card numbers are 16 digits long allowing for a total of different numbers. If we were to dimension an array of accounts of size we would have more accounts than there are people on the planet – way more! (Not to mention the difficulty of finding a machine with enough memory…) It is obviously not reasonable to consider using the VISA number to index into an array.

CPSC 252 Hashing Page 4 Example: What if the key is a string? Social insurance numbers in the UK, for example, are alphanumeric. We cannot use a string to index into an array. Hashing solves these problems. We define a function that transforms keys into an array index – such a function is called a hash function.

CPSC 252 Hashing Page 5 Hash Functions When we design a hash function it should: - be easy to compute – note that every time we want to retrieve a value we have to invoke the hash function to map the key value into a corresponding index into the array. We would lose some of the benefits of our O(1) access time if the hash function was very inefficient. - even distribution of key values across the range of array indices – mapping all key values to one particular index makes for a very easy hash function but the key values will all collide. We will discuss collisions a little later on.

CPSC 252 Hashing Page 6 Strategies for building hash functions Truncation: use only part of the key as the hash index. For example, if we want to build a table of student records at UBC, rather than using the entire student number (8 digits long) as the key value, we could use only the first 5 digits. We would therefore require an array of student records dimensioned to have size Note that some students may have the first 5 digits of their student number in common and so collisions could occur (more later).

CPSC 252 Hashing Page 7 Strategies for building hash functions Folding: partition the key into parts and combine the parts using arithmetic operations. For example, we could take a VISA number and partition it into one 6 digit number plus two 5 digit numbers. We could then add these three numbers together to come up with a hash index.

CPSC 252 Hashing Page 8 Modular Arithmetic: use the modulus operator to produce a hash index in a required range. For example, suppose a company having about 150 employees wants to use an employee’s social insurance number as a key value. We might dimension the array of employee data to be 200 (leaving room for expansion). We can then generate a hash index from the social insurance number as follows: hashIndex = SINumber % 200; Again we would expect that this hash function could lead to collisions.

CPSC 252 Hashing Page 9 Note: in general we must be sure to use the most significant parts of a key. Suppose for example that a company assigns employee ID numbers as follows: bpqrcs where b is a branch number, c is a check digit and pqrs are particular to the employee. We want to construct a hash function that extracts digits pqrs from the employee ID: int hash( int key ) { int hashCode; hashCode = ( ( key / 100 ) % 1000 ); hashCode = hashCode * 10 + ( key % 10 ); return hashCode; }

CPSC 252 Hashing Page 10 Non-integer keys and Alphanumeric Keys Suppose we have a table dimensioned to hold 5000 records and keys that consist of strings that are 6 characters long. We can apply numeric operations to the ASCII codes of the characters in the string in order to determine a hash index: int hash( unsigned char* key ) { int hashCode = 0; int index = 0; while( key[ index ] != ‘\0’ ) hashCode += int( key[ index++ ] ); return hashCode % 5000; } This function returns hash codes in the range 0 to 1530 (6 * 255) so only the first 1530 slots in the array will be used!

CPSC 252 Hashing Page 11 The hash function on the previous slide fails to distribute key values uniformly across the available range of indices. We obviously want to increase the range of hash codes that are produced… while( key[ index ] != ‘\0’ ) hashCode = 2 * hashCode + int( key[ index++ ] ); Now, before the % operation, the hashCode is in the range 0 to When we apply the % operation we therefore end up with hash codes in the range 0 to 4999.

CPSC 252 Hashing Page 12 Collision Resolution Many of the hash functions that we have seen will result in more than one key mapped to the same index. This is known as a collision. A collision resolution strategy is an algorithm for specifying what to do when a collision occurs. Linear Probing Suppose we want to insert an item into the table but the slot at the hashed index is already occupied. The linear probing algorithm increases the hashed index linearly until an empty slot is found. If we reach the end of the array, we wrap around to the beginning and continue the search. Example: suppose we have 10 items each having a 2-digit key. We define the hash code as follows: hashCode = key % 10; // use last digit of key

CPSC 252 Hashing Page 13 Let’s consider how an array of dimension 10 will be filled if the following key values are used to insert data into the table in the order presented: Key: 32, 47, 26, 34, 87, 39, 78, 61, 48, 66 Hash: # probes: Note that as the table starts to fill, the number of probes increases. In the last case, we perform O(N) probes and so we appear to have lost the benefits of hashing.

CPSC 252 Hashing Page 14 A major drawback to linear probing is clustering. This occurs when occupied slots cluster together in groups. This condition is known as primary clustering. Once clustering has started to occur, the number of probes needed to find an empty slot starts to increase. It can be determined empirically that: - when a table is 10% full there is no clustering - when a table is 30-50% full, clusters start to form - when a table reaches 70% full, clusters start colliding with each other resulting in secondary clustering. By reducing clustering, we reduce the number of probes necessary to find an empty slot when a collision occurs. The results above suggest that with linear probing we should keep the table no more than about half full.

CPSC 252 Hashing Page 15 We now see that the time efficiency gained from hashing comes at the cost of space efficiency. If we want to create a table of 10,000 employee records, we should dimension the array to have size 20,000 so that the array is always at most half full. There are other drawbacks to linear probing: - it is very expensive to expand the table if it starts to become full. We have to create a larger array and then re- hash and re-insert all of the items into the new table in order to re-distribute them uniformly throughout the table. - when the table reaches about 75% full, the search time degrades to O(N) at which point we might as well revert to a linear search (array or list) which is much simpler to implement.

CPSC 252 Hashing Page 16 Deleting an item from a hash table When linear probing is used to insert items into a hash table, we have to be careful about how we remove items from that table. Suppose that the following keys are used to insert data into a table in the order given: , , , , And that the hash code is generated as: hashCode = key % so we use only the last four digits of the key. We will assume that we can mark each slot in the array as either empty or occupied.

CPSC 252 Hashing Page 17 Partial picture of hash table: Now suppose that we delete entry by marking array slot 4474 as empty. If we again try to search for entry using the strategy above, the search will end at slot 4474 (which is now empty) and we can’t find entry Now suppose that we search for entry – we start at location 4473 and search until we either find the item or we reach an empty slot, whichever happens first

CPSC 252 Hashing Page 18 One way to address this problem is to assign to each location a flag which can have one of the values: empty, occupied or dirty. Initially, every location is marked as empty. When an item is inserted into a location that location is marked as occupied. When an item is deleted, we mark the location as dirty. Now, when we search for an item, we treat all the dirty locations as occupied and hence our search continues until we find a slot that is marked as empty. However, when we want to insert an item, we treat all the dirty locations as empty.

CPSC 252 Hashing Page 19 Partial picture of hash table: On deleting , we mark the slot as dirty. When we search for item , on finding the dirty slot, we keep searching. We stop searching when we find a slot that is marked as empty or we have searched the entire array. If we now try to insert item we insert it the first slot that is marked either empty or dirty