CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001.

Slides:



Advertisements
Similar presentations
CSCE 3400 Data Structures & Algorithm Analysis
Advertisements

Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Hashing CS 3358 Data Structures.
CSE 326: Data Structures: Hash Tables
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Spring 2015 Lecture 6: Hash Tables
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
CSE 221: Algorithms and Data Structures Lecture #8 The Constant Struggle for Hash Steve Wolfman 2014W1 1.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
CSE 326: Data Structures Lecture #12
Hashing as a Dictionary Implementation Chapter 19.
Hash Tables - Motivation
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSE 326: Data Structures Lecture #14 Whoa… Good Hash, Man Steve Wolfman Winter Quarter 2000.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
CSE 326: Data Structures Lecture #15 When Keys Collide
CSCI 210 Data Structures and Algorithms
Hashing Alexandra Stefan.
Hashing CENG 351.
Hashing Alexandra Stefan.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CSE 326: Data Structures Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
CSE 326: Data Structures Lecture #12 Hashing II
Hashing.
CSE 326: Data Structures Lecture #14
Presentation transcript:

CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001

HashingCSE 326 Autumn Reminder: Dictionary ADT Dictionary operations  insert  find  delete  create  destroy Stores values associated with user- specified keys  values may be any (homogeneous) type  keys may be any (homogeneous) comparable type Adrien Roller-blade demon Hannah C++ guru Dave Older than dirt … insert find(Adrien) Adrien Roller-blade demon Donald l33t haxtor

HashingCSE 326 Autumn Dictionary Implementations So Far InsertFindDelete Unsorted listO(1)O(n) TreesO(log n) Sorted arrayO(n)O(log n)O(n) Array special case known keys  {1, …, K} O(1)

HashingCSE 326 Autumn ADT Legalities: A Digression on Keys Methods are the contract between an ADT and the outside agent (client code)  Ex: Dictionary contract is {insert, find, delete}  Ex: Priority Q contract is {insert, deleteMin} Keys are the currency used in transactions between an outside agent and ADT  Ex: insert(key), find(key), delete(key) So …  How about O(1) insert/find/delete for any key type?

HashingCSE 326 Autumn Hash Table Goal: Key as Index We can access a record as a[5]We want to access a record as a[“Hannah”] Adrien roller-blade demon 2 Hannah C++ guru 5 Adrien roller-blade demon Adrien Hannah C++ guru Hannah

HashingCSE 326 Autumn Hash Table Approach But… is there a problem with this pipe-dream? f(x) Hannah Dave Adrien Donald Ed

HashingCSE 326 Autumn Hash Table Dictionary Data Structure Hash function: maps keys to integers Result:  Can quickly find the right spot for a given entry Unordered and sparse table Result:  Cannot efficiently list all entries  Cannot efficiently find min, max, ordered ranges f(x) Hannah Dave Adrien Donald Ed

HashingCSE 326 Autumn Hash Table Taxonomy f(x) Hannah Dave Adrien Donald Ed hash function collision keys load factor = # of entries in table tableSize

HashingCSE 326 Autumn Agenda: Hash Table Design Decisions  What should the hash function be?  What should the table size be?  How should we resolve collisions?

HashingCSE 326 Autumn Hash Function Hash function maps a key to a table index Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; }

HashingCSE 326 Autumn What Makes A Good Hash Function? Fast runtime  O(1) and fast in practical terms Distributes the data evenly  hash(a) % size  hash(b) % size Uses the whole hash table  for all 0  i < size,  k such that hash(k) % size = i

HashingCSE 326 Autumn Good Hash Function for Integer Keys Choose  tableSize is prime  hash(n) = n Example:  tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17)

HashingCSE 326 Autumn Good Hash Function for Strings? Let s = s 1 s 2 s 3 s 4 …s n : choose  hash(s) = s 1 + s s s … + s n 128 n  Think of the string as a base 128 (aka radix 128) number Problems:  hash(“really, really big”) = well… something really, really big  hash(“one thing”) % 128 = hash(“other thing”) % 128

HashingCSE 326 Autumn String Hashing Issues and Techniques Minimize collisions  Make tableSize and radix relatively prime Typically, make tableSize not a multiple of 128 Simplify computation  Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s[i] + 128*h) % tableSize; } return h; }

HashingCSE 326 Autumn Good Hashing: Multiplication Method Hash function is defined by size plus a parameter A h A (k) =  size * (k*A mod 1)  where 0 < A < 1 Example: size = 10, A = h A (50) =  10 * (50*0.485 mod 1)  =  10 * (24.25 mod 1)  =  10 * 0.25  = 2  no restriction on size!  when building a static table, we can try several values of A  more computationally intensive than a single mod

HashingCSE 326 Autumn Hashing Dilemma Suppose your Worst Enemy 1) knows your hash function; 2) gets to decide which keys to send you? Faced with this enticing possibility, Worst Enemy decides to: a) Send you keys which maximize collisions for your hash function. b) Take a nap. Moral: No single hash function can protect you! Faced with this dilemma, you: a) Give up and use a linked list for your Dictionary. b) Drop out of software, and choose a career in fast foods. c) Run and hide. d) Proceed to the next slide, in hope of a better alternative.

HashingCSE 326 Autumn Universal Hashing 1 Suppose we have a set K of possible keys, and a finite set H of hash functions that map keys to entries in a hashtable of size m. 1 Motivation: see previous slide (or visit Definition: H is a universal collection of hash functions if and only if … For any two keys k 1, k 2 in K, there are at most |H|/m functions in H for which h(k 1 ) = h(k 2 ).  So … if we randomly choose a hash function from H, our chances of collision are no more than if we get to choose hash table entries at random! m-1 K H h hihi hjhj k2k2 k1k1

HashingCSE 326 Autumn Random Hashing – Not! How can we “randomly choose a hash function”?  Certainly we cannot randomly choose hash functions at runtime, interspersed amongst the inserts, finds, deletes! Why not?  We can, however, randomly choose a hash function each time we initialize a new hashtable. Conclusions  Worst Enemy never knows which hash function we will choose – neither do we!  No single input (set of keys) can always evoke worst-case behavior

HashingCSE 326 Autumn Good Hashing: Universal Hash Function A (UHF a ) Parameterized by prime table size and vector: a = where 0 <= a i < size Represent each key as r + 1 integers where k i < size  size = 11, key = ==>  size = 29, key = “hello world” ==> h a (k) =

HashingCSE 326 Autumn UHF a : Example  Context: hash strings of length 3 in a table of size 131 let a = h a (“xyz”) = (35* * *122) % 131 = 129

HashingCSE 326 Autumn Thinking about UHF a Strengths:  works on any type as long as you can form k i ’s  if we’re building a static table, we can try many values of the hash vector  random has guaranteed good properties no matter what we’re hashing Weaknesses  must choose prime table size larger than any k i

HashingCSE 326 Autumn Good Hashing: Universal Hash Function 2 (UHF 2 ) Parameterized by j, a, and b:  j * size should fit into an int  a and b must be less than size h j,a,b (k) = ((ak + b) mod (j*size))/j

HashingCSE 326 Autumn UHF 2 : Example Context: hash integers in a table of size 16 let j = 32, a = 100, b = 200 h j,a,b (1000) = ((100* ) % (32*16)) / 32 = ( % 512) / 32 = 360 / 32 = 11

HashingCSE 326 Autumn Thinking about UHF 2 Strengths  if we’re building a static table, we can try many parameter values  random a,b has guaranteed good properties no matter what we’re hashing  can choose any size table  very efficient if j and size are powers of 2 (why?) Weaknesses  need to turn non-integer keys into integers

HashingCSE 326 Autumn Hash Function Summary Goals of a hash function  reproducible mapping from key to table index  evenly distribute keys across the table  separate commonly occurring keys (neighboring keys?)  fast runtime Some hash function candidates  h(n) = n % size  h(n) = string as base 128 number % size  Multiplication hash: compute percentage through the table  Universal hash function A: dot product with random vector  Universal hash function 2: next pseudo-random number

HashingCSE 326 Autumn Hash Function Design Considerations  Know what your keys are  Study how your keys are distributed  Try to include all important information in a key in the construction of its hash  Try to make “neighboring” keys hash to very different places  Prune the features used to create the hash until it runs “fast enough” (very application dependent)

HashingCSE 326 Autumn Handling Collisions Pigeonhole principle says we can’t avoid all collisions  try to hash without collision n keys into m slots with n > m  try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry?  Separate Chaining: put a little dictionary in each entry  Open Addressing: pick a next entry to try within hashtable Terminology madness :-(  Separate Chaining sometimes called Open Hashing  Open Addressing sometimes called Closed Hashing

HashingCSE 326 Autumn ad eb c Separate Chaining Put a little dictionary at each entry  Commonly, unordered linked list (chain)  Or, choose another Dictionary type as appropriate (search tree, hashtable, etc.) Properties  can be greater than 1  performance degrades with length of chains  Alternate Dictionary type (e.g. search tree, hashtable) can speed up secondary search h(a) = h(d) h(e) = h(b)

HashingCSE 326 Autumn Separate Chaining Code [private] Dictionary & findBucket(const Key & k) { return table[hash(k)%table.size]; } void insert(const Key & k, const Value & v) { findBucket(k).insert(k,v); } Value & find(const Key & k) { return findBucket(k).find(k); } void delete(const Key & k) { findBucket(k).delete(k); }

HashingCSE 326 Autumn Load Factor in Separate Chaining Search cost  unsuccessful search:  successful search: Desired load factor:

HashingCSE 326 Autumn Open Addressing Allow one key at each table entry  two objects that hash to the same spot can’t both go there  first one there gets the spot  next one must go in another spot Properties   1  performance degrades with difficulty of finding right spot a c e h(a) = h(d) h(e) = h(b) d b

HashingCSE 326 Autumn Probing Requires collision resolution function f(i) Probing how to:  First probe - given a key k, hash to h(k)  Second probe - if h(k) is occupied, try h(k) + f(1)  Third probe - if h(k) + f(1) is occupied, try h(k) + f(2)  And so forth Probing properties  we force f(0) = 0  i th probe is to (h(k) + f(i)) mod size  if i reaches size - 1, the probe has failed  depending on f(), the probe may fail sooner  long sequences of probes are costly!

HashingCSE 326 Autumn Linear Probing f(i) = i Probe sequence is  h(k) mod size  h(k) + 1 mod size  h(k) + 2 mod size  … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

Linear Probing Example probes: insert(55) 55%7 = insert(76) 76%7 = insert(93) 93%7 = insert(40) 40%7 = insert(47) 47%7 = insert(10) 10%7 =

HashingCSE 326 Autumn Load Factor in Linear Probing For any < 1, linear probing will find an empty slot Search cost (for large table sizes)  successful search:  unsuccessful search: Linear probing suffers from primary clustering Performance quickly degrades for > 1/2

HashingCSE 326 Autumn Quadratic Probing f(i) = i 2 Probe sequence:  h(k) mod size  h(k) + 1 mod size  h(k) + 4 mod size  h(k) + 9 mod size  … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k), i = 0; do { entry = &table[probePoint]; i++; probePoint = (probePoint + (2*i - 1)) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

Good Quadratic Probing Example probes: insert(76) 76%7 = insert(40) 40%7 = insert(48) 48%7 = insert(5) 5%7 = insert(55) 55%7 =

Bad Quadratic Probing Example  probes: insert(76) 76%7 = insert(47) 47%7 = 5  insert(93) 93%7 = insert(40) 40%7 = insert(35) 35%7 =

HashingCSE 326 Autumn Quadratic Probing Succeeds for  ½ If size is prime and  ½, then quadratic probing will find an empty slot in size/2 probes or fewer.  show for all 0  i, j  size/2 and i  j (h(x) + i 2 ) mod size  (h(x) + j 2 ) mod size  by contradiction: suppose that for some i, j: (h(x) + i 2 ) mod size = (h(x) + j 2 ) mod size i 2 mod size = j 2 mod size (i 2 - j 2 ) mod size = 0 [(i + j)(i - j)] mod size = 0  but how can i + j = 0 or i + j = size when i  j and i,j  size/2 ?  same for i - j mod size = 0

HashingCSE 326 Autumn Quadratic Probing May Fail for > ½  For any i larger than size/2, there is some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!

HashingCSE 326 Autumn Load Factor in Quadratic Probing  For any  ½, quadratic probing will find an empty slot  For > ½, quadratic probing may find a slot  Quadratic probing does not suffer from primary clustering  Quadratic probing does suffer from secondary clustering  How could we possibly solve this?

HashingCSE 326 Autumn Double Hashing f(i) = i*hash 2 (k) Probe sequence:  h 1 (k) mod size  (h 1 (k) + 1  h 2 (x)) mod size  (h 1 (k) + 2  h 2 (x)) mod size  … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash 1 (k), delta = hash 2 (k); do { entry = &table[probePoint]; probePoint = (probePoint + delta) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

HashingCSE 326 Autumn A Good Double Hash Function… … is quick to evaluate. … differs from the original hash function. … never evaluates to 0 (mod size). One good choice: Choose a prime p < size Let hash 2 (k)= p - (k mod p)

Double Hashing Double Hashing Example (p=5) probes: insert(55) 55%7 = (55%5) = insert(76) 76%7 = insert(93) 93%7 = insert(40) 40%7 = insert(47) 47%7 = (47%5) = insert(10) 10%7 =

HashingCSE 326 Autumn Load Factor in Double Hashing For any < 1, double hashing will find an empty slot (given appropriate table size and hash 2 ) Search cost appears to approach optimal (random hash):  successful search:  unsuccessful search: No primary clustering and no secondary clustering One extra hash calculation

HashingCSE 326 Autumn delete(2) find(7) Where is it?! Deletion in Open Addressing  Must use lazy deletion!  On insertion, treat a (lazily) deleted item as an empty slot

HashingCSE 326 Autumn The Squished Pigeon Principle  Insert using Open Addressing cannot work with  1.  Insert using Open Addressing with quadratic probing may not work with  ½.  With Separate Chaining or Open Addressing, large load factors lead to poor performance! How can we relieve the pressure on the pigeons?  Hint: what happens when we overrun array storage in a {queue, stack, heap}?  What else must happen with a hashtable?

HashingCSE 326 Autumn Rehashing When the gets “too large” (over some constant threshold), rehash all elements into a new, larger table:  takes O(n), but amortized O(1) as long as we (just about) double table size on the resize  spreads keys back out, may drastically improve performance  gives us a chance to retune parameterized hash functions  avoids failure for Open Addressing techniques  allows arbitrarily large tables starting from a small table  clears out lazily deleted items

HashingCSE 326 Autumn Case Study Spelling dictionary  30,000 words  static  arbitrary(ish) preprocessing time Goals  fast spell checking  minimal storage Practical notes  almost all searches are successful – Why?  words average about 8 characters in length  30,000 words at 8 bytes/word ~.25 MB  pointers are 4 bytes  there are many regularities in the structure of English words

HashingCSE 326 Autumn Case Study: Design Considerations Possible Solutions  sorted array + binary search  Separate Chaining  Open Addressing + linear probing Issues  Which data structure should we use?  Which type of hash function should we use?

HashingCSE 326 Autumn Case Study: Storage Assume words are strings and entries are pointers to strings Array + binary search Separate Chaining … Open addressing How many pointers does each use?

HashingCSE 326 Autumn Case Study: Analysis storagetime Binary search n pointers + words = 360KB log 2 n  15 probes per access, worst case Separate Chaining n + n/ pointers + words ( = 1  600KB) 1 + /2 probes per access on average ( = 1  1.5 probes) Open Addressing n/ pointers + words ( = 0.5  480KB) (1 + 1/(1 - ))/2 probes per access on average ( = 0.5  1.5 probes) What to do, what to do? …

HashingCSE 326 Autumn Perfect Hashing When we know the entire key set in advance …  Examples: programming language keywords, CD- ROM file list, spelling dictionary, etc. … then perfect hashing lets us achieve:  Worst-case O(1) time complexity!  Worst-case O(n) space complexity!

HashingCSE 326 Autumn Perfect Hashing Technique  Static set of n known keys  Separate chaining, two-level hash  Primary hash table size=n  j th secondary hash table size=n j 2 (where n j keys hash to slot j in primary hash table)  Universal hash functions in all hash tables  Conduct (a few!) random trials, until we get collision-free hash functions Primary hash table Secondary hash tables

HashingCSE 326 Autumn Perfect Hashing Theorems 1 Theorem: If we store n keys in a hash table of size n 2 using a randomly chosen universal hash function, then the probability of any collision is < ½. Theorem: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function, then where n j is the number of keys hashing to slot j. Corollary: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function and we set the size of each secondary hash table to m j =n j 2, then: a)The expected amount of storage required for all secondary hash tables is less than 2n. b)The probability that the total storage used for all secondary hash tables exceeds 4n is less than ½. 1 Intro to Algorithms, 2 nd ed. Cormen, Leiserson, Rivest, Stein

HashingCSE 326 Autumn Perfect Hashing Conclusions Perfect hashing theorems set tight expected bounds on sizes and collision behavior of all the hash tables (primary and all secondaries).  Conduct a few random trials of universal hash functions, by simply varying UHF parameters, until we get a set of UHFs and associated table sizes which deliver …  Worst-case O(1) time complexity!  Worst-case O(n) space complexity!

HashingCSE 326 Autumn Extendible Hashing: Cost of a Database Query I/O to CPU ratio is 300-to-1!

HashingCSE 326 Autumn Extendible Hashing Hashing technique for huge data sets  optimizes to reduce disk accesses  each hash bucket fits on one disk block  better than B-Trees if order is not important – why? Table contains  buckets, each fitting in one disk block, with the data  a directory that fits in one disk block is used to hash to the correct bucket

HashingCSE 326 Autumn Extendible Hash Table  Directory entry: key prefix (first k bits) and a pointer to the bucket with all keys starting with its prefix  Each block contains keys matching on first j  k bits, plus the data associated with each key (2) (2) (3) (3) (2) directory for k = 3

Inserting (easy case) (2) (2) (3) (3) (2) insert(11011) (2) (2) (3) (3) (2)

Splitting a Leaf (2) (2) (3) (3) (2) insert(11000) (2) (2) (3) (3) (3) (3)

HashingCSE 326 Autumn Splitting the Directory 1. insert(10010) But, no room to insert and no adoption! 2. Solution: Expand directory 3. Then, it’s just a normal split (2) (2) (2)

HashingCSE 326 Autumn If Extendible Hashing Doesn’t Cut It Store only pointers to the items + (potentially) much smaller M + fewer items in the directory – one extra disk access! Rehash + potentially better distribution over the buckets + fewer unnecessary items in the directory – can’t solve the problem if there’s simply too much data What if these don’t work?  use a B-Tree to store the directory!

HashingCSE 326 Autumn Hash Wrap Collision resolution Separate Chaining  Expand beyond hashtable via secondary Dictionaries  Allows > 1 Open Addressing  Expand within hashtable  Secondary probing: {linear, quadratic, double hash}   1 (by definition!)   ½ (by preference!) Rehashing  Tunes up hashtable when crosses the line Hash functions  Simple integer hash: prime table size  Multiplication method  Universal hashing guarantees no (always) bad input Perfect hashing  Requires known, fixed keyset  Achieves O(1) time, O(n) space - guaranteed! Extendible hashing  For disk-based data  Combine with b-tree directory if needed