Hashing: is an alternative search technique (earlier we had BST) Motivation: Try to access directly each possible keys! Suggestion: Enumerate possible keys. This means an ordering. HASHING
Example: key space: 3 binary digits 1 2 3 4 7 8 N=2³ 1 1 1 1 1
Problem: in real life this is not possible. Names : ~20 characters long needs 26²º (26 letters). 2. Digital numbers: 20 digits needs 10²º (0,..,9: 10 letters). Binary Numbers: 20 spaces needs 2²º. => Too many => we cannot consider all.
We cannot create such long vectors such as 2²º or…. 26²º! But: We don’t need all such long vectors because not all combinations occur in practice! For example: A vocabulary contains maybe 100,000 ~ words. Personal names (memphis). White pages: 300pp*300 ~ 100,000 Or ~ ½ million 26²º.
Idea: Assume that, There are ~ N keys occur (approx.) Define a vector of length ~ 2N. Assign integers [1,..,2N] to the keys! Look up according to the serial number (integer).
A Dynamical System Perspective: Hashing, Chopping up, Granulating, Coarse graining information! This is done: Continuously, Autonomously, Reliably… in Bio-Systems! (worms, ants….., humans….alike)
The Major Challenge Of Life: How the delicately defined living substances can exist in an infinitely complex world? How animals can survive and succeed? How they separate the important from the useless? =>There is/are mechanisms to complete ‘hashing’ very efficiently and promptly. =>This course is far from that but indicates a few main principles.
Major issues in Hashing: How to assign the hashing? Using the hashing function. 2. Sometimes the hashing function gives the same number to different keys. We have to resolve. _________________________________
Complete space Hashing HASH TABLE K L Eg: 26²º elements 2N
Example: dates: 1055, 1492, 1776, 1812, 1918, 1945. Q: What is complete sp? Hash Function: HashCode(x) = (5x mod 8) 0 1 2 3 4 5 6 7 (hash code) 1776 1055 1492 1812 1945 1918
Evaluation of closed address or chained hashing Costs of Search: Compute hash code I : costs ‘a’. Search through linked list H[i]. Linked lists H[1]………H[h] hashing L1 L2 L3 … Lh ~
Average total cost of search k: T(n)=a + 1/n (h-1,i=0)(L1 + 1)/2 Worst case:Bad Distribution: All are in the same bucket. Needs n/2 comparisons in average same as search unordered array. Better:Good Distribution: Equally distributed among cells. Load factor = n/h const. f cells average O(1) computations. Search # [i]
Hashing evaluation continued: For uniform distribution: there is very good performance. But a hashing function is required that gives uniform distribution independently of actual data structure! Randomization: computer pseudo- random generator. Eg: multiplicative congruent.
HashCode (K) = (aK) mod h. Strategy: multiply with constant a and take the modulus (i.e. remainder after division). HashCode (K) = (aK) mod h.
Open Address Hashing: this is really dynamic. does not allow collisions as linked lists (before in closed hashing). load factor; = n/h <1 (if >= 0.5, array doubling) If there is a collision: Rehashing. Linear Probing. Simple: Rehash (j) = (j+1) mod h (j is the most recent probed location, start with j = i, go until empty cell found. Eg: 6.10)
Rehashing: 2. Double Hashing: Rehash (j,d) = (j+d) mod h Here d-increment of rehashing. If d = 1 linear rehashing it is determined separately.
Schedule 18-month schedule highlights Timing Isolate timing dependencies critical to success Jan Feb Mar Apr May Jun July Sep Oct Nov Dec Task 2 Task 3 Task 4 Task 1 Milestone