Ihab Mohammed and Safaa Alwajidi
Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table idea is as follows: First, we have a universe U of n objects and each object have a key. Second, we want to store a set of U named S in a structure in the computer which have m buckets (locations) S U Object 1Object 2….Object n key Object 1Object 2….Object m
Introduction Third, a hash function h which maps a key u from the set U to a location of an object in S: S[h(u)] = Object Now to see things clearly, lets have an example: if U is the set of integer numbers from 1 to 100 and we want to store a set of ten of these numbers in the hash table S using module 10 operation so: h(u) = u mod 10 Now lets try to store the number 33: h(33) = 33 mod 10 = 3 which means that the number 33 is stored in location 3 in S
Collision Problem To store the number 56: h(56) = 56 mod 10 = 6 which means that the number 56 is stored in location 6 in S Watch what happens when we try to store the number 43: h(43) = 43 mod 10 = 3 but location 3 already have the value 33 so we have a collision
Collision Solution Hash tables theory is about: Solve collisions. Choose a hashing function that reduce collisions. History: Started in 1953 by some groups in IBM and it has a simple implementation since the hash function was simple with no performance guarantees. If we map the keys of a big universe U to a small set S = {0,..., s − 1}, then it is unavoidable that many universe elements are mapped to the same element of S.
Hash Table Design Collisions can be solved in two ways: Chaining: Use a link list in the location were collisions occur to store multiple (collided) objects. Open Addressing: using a sequence of alternative addresses for same key u so if h1(u) is used then when collision occur use h2(u) hen h3(u) and so on.
Chaining In chaining, hash function takes you to the correct location in the primary structure (array), then you have to search in the secondary structure (link list) for the correct object. A balance search tree can be used as a secondary structure and searching time is O(log n) so total searching time: O(1) by hash function + O(log n) by BST Note: With a good hash function and a not-too-small hash table most buckets are expected to be almost empty which shows that link list as a secondary search structure is enough.
Open Addressing In open addressing, no secondary structure is needed which makes it a simple method. However, this method does not support deletion!? To store an object, a sequence of addresses (hash function) is called until an empty location is found. h1 (43) h2 (43) h1 (73) h2 (73) h3 (73) Insert object Insert object Search for Object 43 is deleted Search for object 73 h1 (43) h2 (43) h1 (73) h2 (73) Huston, we have a problem!
Back to Future (Chaining) Avoid open addressing: the small space advantage of avoiding pointers does never outweigh the fundamental disadvantage of losing deletions Chaining Variants: Two Way Chaining: each element of the universe is assigned to two possible buckets and objects are inserted to the bucket with fewer elements. Sequence of Hash Tables: if entry in the first hash table is used then go to second hash table and use a different hash function and so on. Also, this is convenient for parallelization.
Universal Families of Hash Functions Uniform Hashing Model: hash values of the elements are independent random values, uniformly distributed on the available addresses (Up to the end of the 1970s). Any hash function that is complicated enough will behave like a random assignment, mixing the values of the input set sufficiently well The above is incorrect because the set of values is concrete and not uniformly distributed in the universe U.
Universal Families of Hash Functions Choose randomly a hash function from a family of hash functions in which for any input set the values of the hash functions are well distributed with high probability. For any choice of hash function, there exists a bad set of keys that all hash to same slot Crucial Property: for a family of hash functions F to distribute a set of U well over S, choose a function f belongs to F uniformly at random that satisfies the following:
Universal Families of Hash Functions
Family Example Properties of a family of universal hash functions: it must be small and have a convenient parameterization, so we can easily select the random function from this family. it must be easy to evaluate. Assume that U = {0, …, p-1} for some large prime p chosen as slightly less than the square root of maximum integer the machine arithmetic can handle (product of two such numbers). Assume that S = {0, …, s-1} with s <= p. Now, the simplest family is: Fps = {ha: U → S | ha(x) = (ax mod p) mod s, 1 ≤ a ≤ p − 1} There are p-1 functions
Universal Families of Hash Functions Summery: Theorem: the hash table with chaining, using a universal family of hash functions, stores a set of n elements in a table of size s, supporting the operations find, insert, and delete, in expected time O(1 + n/s) for each operation and requires space O(n + s).