Hash Tables. Introduction A hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Hash Tables.
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing CS 3358 Data Structures.
Data Structures Hash Tables
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
Hash Tables.  What can we do if we want rapid access to individual data items?  Looking up data for a flight in an air traffic control system  Looking.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing Hashing is another method for sorting and searching data.
Hashing as a Dictionary Implementation Chapter 19.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Hash Tables. Group Members: Syed Husnain Bukhari SP10-BSCS-92 Ahmad Inam SP10-BSCS-06 M.Umair Sharif SP10-BSCS-38.
Chapter 11 (Lafore’s Book) Hash Tables Hwajung Lee.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Data Abstraction & Problem Solving with C++
Design and Analysis of Algorithms
Hash Tables.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Advanced Implementation of Tables
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Presentation transcript:

Hash Tables

Introduction A hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time. An algorithm converts an object, typically a string, to a number. Then the number is compressed according to the size of the table and used as an index. There is the possibility of distinct items being mapped to the same key. This is called a collision and must be resolved.

Key  Hash Code Generator  Number  Compression  Index Smith  Bob Smith 123 Main St. Orlando, FL

Introduction A Hash Table offers very fast insertion and searching, almost O(1). Relatively easy to program as compared to trees. Based on arrays, hence difficult to expand. No convenient way to visit the items in a hash table in any kind of order.

Hashing A range of key values can be transformed into a range of array index values. A simple array can be used where each record occupies one cell of the array and the index number of the cell is the key value for that record. But keys may not be well arranged. In such a situation hash tables can be used.

Employee Database Example A small company with, say, 1,000 employees, every employee has been given a number from 1 to 1,000, etc. What sort of data structure should you use in this situation? Index Numbers As Keys One possibility is a simple array. Each employee record occupies one cell of the array, and the index number of the cell is the employee number for that record. O(1) time to perform insert, remove, search Not Always So Orderly speed and simplicity of data access make this very attractive. However, this example works only because the keys are unusually well organized. Ideal case is unrealistic!

A Dictionary Most applications have key ranges that are too large to have 1-1 mapping between array indices and keys! The classic example where keys not so orderly is a dictionary. To put every word of an English-language dictionary, from a to zyzzyva (yes, it’s a word), into your computer’s memory so they can be accessed quickly, a hash table is a good choice. A similar widely used application is in computer-language compilers, which maintain a symbol table in a hash table. The symbol table holds all the variable and function names along with their addresses in memory. The program needs to access these names very quickly, so a hash table is the preferred data structure.

The Dictionary Example To store a 50,000-word dictionary in main memory, so that every word occupies its own cell in a 50,000-cell array, so you can access the word using an index number. What we need is a system for turning a word into an appropriate index number.

Converting Words to Numbers Two approaches: 1.Adding the digits :- Add the code numbers for each character. E.g. cats: c = 3, a = 1, t = 20, s = 19, gives 43. If we restrict ourselves to 10-letter words then the first word a, would be coded by = 1 The last potential word would be zzzzzzzzzz. Its code would be = 260 Thus, the Total range of word codes is from 1 to 260. – But 50,000 words exist. So there are not enough index numbers to go around. Each array element will need to hold about 192 words (50,000 / 260). Problems - Too many words have the same index. (For example, was, tin, give, tend, moan, tick, bails, dredge, and hundreds of other words add to 43, as cats does.)

Converting Words to Numbers 2.Multiplying by powers Decompose a word into its letters. Convert letters to numerical equivalents. Multiply by appropriate powers of 27 and add results Eg. to convert the word cats to a number. 3* * * *27 0 = 60,337. The largest 10-letter word, zzzzzzzzzz > 7,000,000,000,000 An array stored in memory can’t possibly have this many elements. This scheme assigns an array element to every potential word yet Only a small fraction of these cells are necessary for real words, so most array cells are empty

Hash Function Need to compress the huge range of numbers. arrayIndex = hugenumber % smallRange; This is a hash function. It hashes a number in a large range into a number in a smaller range, corresponding to the index numbers in an array. An array into which data is inserted using a hash function is called a hash table.

Collisions Two words can hash to the same array index, resulting in collision. Open Addressing: Search the array in some systematic way for an empty cell and insert the new item there if collision occurs. Separate chaining: Create an array of linked list of words, so that the item can be inserted into the linked list if collision occurs.

Open Addressing Three methods to find next vacant cell: Linear Probing :- Search sequentially for vacant cells, incrementing the index until an empty cell is found. This is called linear probing because it steps sequentially along the line of cells. The number of steps it takes to find an empty cell is the probe length. Clustering is a problem occurring in linear probing. As the array gets full, clusters grow larger, resulting in very long probe lengths. Performance degrades seriously as the clusters grow larger and larger. Array can be expanded if it becomes too full.

Quadratic Probing The ratio of the number of items in a table to the table’s size is called the load factor. load factor = nItems / arraySize; Even when load factor isn’t high, clusters can form. Quadratic probing is an attempt to keep clusters from forming. The idea is to probe more widely separated cells, instead of those adjacent The step is the square of the step number. In a linear probe, if the primary hash index is x, subsequent probes go to x+1, x+2, x+3, and so on. In quadratic probing, probes go to x+1, x+4, x+9, x+16, x+25, and so on. Eliminates primary clustering, but all the keys that hash to a particular cell follow the same sequence in trying to find a vacant cell (secondary clustering).

Double Hashing Secondary clustering is not a serious problem, but quadratic probing is not often used because there’s a slightly better solution. The idea is to Generate probe sequences that depend on the key instead of being the same for every key. That is: Hash the key a second time using a different hash function and use the result as the step size. Step size remains constant throughout a probe, but it is different for different keys. Secondary hash function should not be the same as primary hash function and it must never output a zero. stepSize = constant – (key % constant); where constant is prime and smaller than the array size. Double hashing requires that size of hash table is a prime number.

In general, double hashing is the probe sequence of choice when open addressing is used.

Separate Chaining In open addressing, collisions are resolved by looking for an open cell in the hash table. A different approach is to install a linked list at each index in the hash table. A data item’s key is hashed to the index in the usual way, and the item is inserted into the linked list at that index. Other items that hash to the same index are simply added to the linked list There’s no need to search for empty cells in the primary array. Figure on next slide shows how separate chaining looks.

Example of separate chaining

Separate Chaining Simpler than the various probe schemes used in open addressing. However, the code is longer because it must include the mechanism for the linked lists. In separate chaining it’s normal to put N or more items into an N cell array; Thus, the load factor (ratio of number of items in a hash table to its size) can be 1 or greater. No problem; some locations will simply contain two or more items in their lists Of course, if there are many items on the lists, access time is reduced. Finding the initial cell takes fast O(1) time, but searching through a list takes O(M) time (M - # of items in the linked list)

Separate Chaining Thus, the lists should not become too full. Deletion poses no problems. Table size need not be a prime number. Buckets Another approach similar to separate chaining is to use an array at each location in the hash table, instead of a linked list. Such arrays are sometimes called buckets. This approach is not as efficient as the linked list approach

Hash Functions A good hash function is simple and can be computed quickly. Speed degrades if hash function is slow. Purpose is to transform a range of key values into index values such that the key values are distributed randomly across all the indices of the hash table. Keys may be completely random or not so random.

Random Keys A perfect hash function maps every key into a different table location (as in the employee-number example). In most cases the hash function will need to compress a larger range of keys into a smaller range of index numbers. Distribution of key values in a particular database determines what the hash function needs to be. For random keys: index = key % arraySize;

Non-random Keys However, data is often distributed non-randomly. Imagine a database that uses car-part numbers as keys. Consider a part number of the form Every digit serves a purpose. Don’t Use Non-Data: The key fields should be squeezed down until every bit counts. For example if the last 3 digits are for error checking, they are redundant, these digits shouldn’t be considered. Use All the Data: Every part of the key (except non-data) should contribute to the hash function. Use a prime number for the modulo base.

Folding Another reasonable hash function is Folding: Break the key into groups of digits and add the groups. This ensures that all the digits influence the hash value. The number of digits in a group should correspond to the size of the array. That is, for an array of 1,000 items, use groups of three digits each.

Folding For example, suppose you want to hash nine-digit Social Security numbers for linear probing. If the array size is 1,000, you would divide the nine-digit number into three groups of three digits. If a particular SSN was , you would calculate a key value of = You can use the % operator to trim such sums so the highest index is 999. In this case, 1368%1000 = 368. If the array size is 100, you would need to break the nine-digit key into four two-digit numbers and one one-digit number: = 189, and 189%100 = 89.

Hashing Efficiency Insertion and searching can approach O(1) time. If collision occurs, access time depends on the resulting probe lengths. Individual insert or search time is proportional to the length of the probe. This is in addition to a constant time for hash function. Relationship between probe length (P) and load factor (L) for linear probing : P = (1+1 / (1 – L 2 )) / 2 for successful search and P = (1 + 1 / (1 – L))/ 2 for unsuccessful search These formulas are from Knuth (The Art of Computer Programming by Donald E. Knuth, of Stanford University, Addison Wesley, 1998). ), and their derivation is quite complicated.

Hashing Efficiency Quadratic probing and Double Hashing share their performance equations. For a successful search :- P = log 2 (1 - loadFactor) / loadFactor For an unsuccessful search :- 1 / (1 - loadFactor) Separate chaining For successful search :- P = 1 + loadFactor /2 For unsuccessful search :- P = 1 + loadFactor For insertion :- P = 1 + loadfactor /2 for ordered lists and 1 for unordered lists.

Open Addressing vs. Separate Chaining If open addressing is to be used, double hashing is preferred over quadratic probing. If plenty of memory is available and the data won’t expand, then linear probing is simpler to implement. If number of items to be inserted in hash table isn’t known, separate chaining is preferable to open addressing. When in doubt use separate chaining

External Storage Hash table can be stored in main memory. If it is too large it can be stored externally on disk, with only part of it being read into main memory at a time. In external hashing its important that the blocks do not become full. Even with a good hash function, the block might become full. This situation can be handled using variations of the collision-resolution schemes.

Summary A hash table is a data structure that offers very fast insertion and searching. No matter how many data items there are, insertion and searching (and sometimes deletion) can take close to constant time: O(1) in big O notation. In practice this is just a few machine instructions. For a human user of a hash table, this is essentially instantaneous. It’s so fast that computer programs typically use hash tables when they need to look up tens of thousands of items in less than a second (as in spelling checkers). Hash tables are significantly faster than trees, which operate in relatively fast O(logN) time. Also hash tables are relatively easy to program.

Summary Hash tables do have several disadvantages: They’re based on arrays, and arrays are difficult to expand. For some kinds of hash tables, performance may degrade terribly when a table becomes too full, so the programmer needs to have a fairly accurate idea of how many data items to be stored (or be prepared to periodically transfer data to a larger hash table, a time-consuming process). Also, there’s no convenient way to visit the items in a hash table in any kind of order (such as from smallest to largest). If you need this capability, you’ll need to look elsewhere. However, if you don’t need to visit items in order, and you can predict in advance the size of your database, hash tables are unparalleled in speed and convenience.