Download presentation
Presentation is loading. Please wait.
Published byJerome Shields Modified over 9 years ago
1
Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005
2
Plan Today Seat assignments Hash functions Reading: For today and next time: Sedgewick Chapter 14 Reminder: HW0 due on Thursday
3
Hash Tables An Alternative Representation for Dictionaries
4
Dictionary Interface An Abstract Data Type that maintains a dynamic set is a Dictionary. Crucial operations: Insert Find Remove Standard operations: create, destroy, copy,…
5
Dictionary Interface insert: may or may not allow multiple occurrences find: membership query, often also retrieve associated information remove: may use deferred actions for speed up amortized running time
6
Small Universe Suppose we have a small universe U = {0,1,2,…,M-1} of items. We want to maintain a subset A of U. Ease: Use an array of bits (boolean) of size M. Insert: A[k] = 1 Find: return A[k] != 0 Remove: A[k] = 0 Operations are constant time.
7
Direct Access Tables In most applications we do not store simple items but pairs (key, object). Use an array of pointers (references to objects). Insert: A[key] = object Find: return A[key] Remove: A[key] = null Again operations are constant time.
8
Large Universe But what if the universe U of keys is large (and the subset is small)? e.g., names, symbol table of a compiler. Even when the identifiers are at most 16 long there are some 10 28 possibilities.
9
Hashing – the Idea Map keys into integers in the range 0.. m-1, m<<M and m is the table size. Pick a “good” mapping from keys to integers: Easy to compute Even distribution into the table 0 1 2 3 4 5 6 7 8 9 10 a b c d e f l h i j k l m n o p q r s t u v w x y z
10
Hashing – Terminology The array in which we store the objects is the hash table. To enter an object into the table, we compute an index from the key. The map from the key to the index is a hash function h: h(key) = index
11
Space-Time Tradeoff A direct table has O(1) operations in the worse case. But space may be prohibitive. Minimize space by using a sequential search. Hashing balances space and time (on average) by changing the size of the hash table.
12
Problem - Collisions Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y). Can we prevent collisions?
13
Pigeonhole Principal There is no way to avoid collisions. Since m << M there must be at least two keys that map to the same index. The famous Pigeonhole Principle: If you put more than k items into k bins, then at least one bin contains more than one item.
14
Problem - Hash Function Second problem: How do we find a suitable hash function? Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions. That is, we want h to appear random, as though “hashing” the keys.
15
Hash Functions
16
Hashing-Efficiency We also need to make sure h(k) is easy to compute. Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree? Goal: All operations should be constant time. But things can go badly wrong on rare occasions.
17
Division method Assume wlog the keys are integers. A simple hash function is h(k) = k mod m, where m is the table size. The choice of m is crucial. Good choice: m prime.
18
Division method Primes are fairly dense, so this is no great restriction on the table size. In fact, we can nearly double the hash table: 31, 61, 127,251, 509, 1021, 2039,… Store these values in a table; don’t try to compute on the fly.
19
Multipication Method Another hash function is h(x) = floor( m ( k r mod 1) ) where 0 < r < 1 is cleverly chosen. Advantage: the choice of m is not critical Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1]. Of course, there is a little problem here.
20
Random Input Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits). However, such a method would fail miserably if the input shows some regularity. No good for general use.
21
Integer keys? The assumption objects in U are integers has to be taken with a grain of salt. Often we have to massage things a bit to extract numbers. Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from these bits.
22
Example: Strings public int hashCode(String key, int m) { int h = 0; for (int i=0; i<key. length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number h %= m; if (h < 0) // overflow? h += m; return h; } This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)
23
Hash functions Desired properties Approximates a random distribution Over the range of table index values Efficient calculation Approaches Modular arithmetic Many Perfect hashing When full set of input keys known in advance
24
Next time: Collisions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.