CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSCE 3400 Data Structures & Algorithm Analysis
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Log Files. O(n) Data Structure Exercises 16.1.
Hashing CS 3358 Data Structures.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
CSE 326: Data Structures: Hash Tables
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 9 Hash Tables (continued) Reminder Examples.
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
CSE 221: Algorithms and Data Structures Lecture #8 The Constant Struggle for Hash Steve Wolfman 2014W1 1.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
CSE 326: Data Structures Lecture #12
Hashing as a Dictionary Implementation Chapter 19.
Hash Tables - Motivation
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
LECTURE 35: COLLISIONS CSC 212 – Data Structures.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSE 326: Data Structures Lecture #14 Whoa… Good Hash, Man Steve Wolfman Winter Quarter 2000.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
Hashing (part 2) CSE 2011 Winter March 2018.
CSE 326: Data Structures Lecture #15 When Keys Collide
CSE 221: Algorithms and Data Structures Lecture #8 What do you get when you put too many pigeons in one hole? Pigeon Hash Steve Wolfman 2009W1.
Hashing - resolving collisions
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Resolving collisions: Open addressing
CSCE 3110 Data Structures & Algorithm Analysis
CSE 326: Data Structures Hashing
Hashing Alexandra Stefan.
CSE 326: Data Structures Lecture #12 Hashing II
Hashing.
Data Structures and Algorithm Analysis Hashing
CSE 373: Data Structures and Algorithms
Presentation transcript:

CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001

Today’s Outline –Hashing strings –Universal hash functions –Collisions –Probing –Rehashing –…

Good Hash Function for Strings? I want to be able to: insert(“kale”) insert(“Krispy Kreme”) insert(“kim chi”)

Good Hash Function for Strings? Sum the ASCII values of the characters. Consider only the first 3 characters. –Uses only 2871 out of 17,576 entries in the table on English words. Let s = s 1 s 2 s 3 s 4 …s 5 : choose –hash(s) = s 1 + s s s … + s n 128 n Problems: –hash(“really, really big”) = well… something really, really big –hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Universal Hashing For any fixed hash function, there will be some pathological sets of inputs –everything hashes to the same cell! Solution: Universal Hashing –Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! –When your program starts up, pick one of the hash functions to use at random (for the entire time) –Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

“Random” Vector Universal Hash Parameterized by prime size and vector: a = where 0 <= a i < size Represent each key as r + 1 integers where k i < size –size = 11, key = ==> –size = 29, key = “hello world” ==> h a (k) = dot product with a “random” vector

“Random” Vector Universal Hash Strengths: –works on any type as long as you can form k i ’s –if we’re building a static table, we can try many a’s –a random a has guaranteed good properties no matter what we’re hashing Weaknesses –must choose prime table size larger than any k i

Alternate Universal Hash Function Parameterized by k, a, and b: –k * size should fit into an int –a and b must be less than size h k,a,b (x) =

Alternate Universal Hash: Example Context: hash integers in a table of size 16 let k = 32, a = 100, b = 200 h k,a,b (1000) = ((100* ) % (32*16)) / 32 = ( % 512) / 32 = 360 / 32 = 11

Alternate Universal Hash Function Strengths: –if we’re building a static table, we can try many parameter values –random a,b has guaranteed good properties no matter what we’re hashing –can choose any size table –very efficient if k and size are powers of 2 Weaknesses –still need to turn non-integer keys into integers

Hash Function Summary Goals of a hash function –reproducible mapping from key to table entry –evenly distribute keys across the table –separate commonly occurring keys complete quickly Hash functions –h(n) = n % size –h(n) = string as base 128 number % size –One Universal hash function: dot product with random vector –Other Universal hash functions…

How to Design a Hash Function Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions –try to hash without collision m keys into n slots with m > n –try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? –open hashing: put little dictionaries in each entry –closed hashing: pick a next entry to try shove extra pigeons in one hole!

ad eb c Open Hashing or Hashing with Chaining Put a little dictionary at each entry –choose type as appropriate –common case is unordered linked list (chain) Properties – can be greater than 1 –performance degrades with length of chains h(a) = h(d) h(e) = h(b)

Open Hashing Code Dictionary & findBucket(const Key & k) { return table[hash(k)%table.size]; } void insert(const Key & k, const Value & v) { findBucket(k).insert(k,v); } void delete(const Key & k) { findBucket(k).delete(k); } Value & find(const Key & k) { return findBucket(k).find(k); }

Load Factor in Open Hashing Search cost –unsuccessful search: –successful search: Desired load factor:

Closed Hashing / Open Addressing What if we only allow one Key at each entry? –two objects that hash to the same spot can’t both go there –first one there gets the spot –next one must go in another spot Properties –  1 –performance degrades with difficulty of finding right spot a c e h(a) = h(d) h(e) = h(b) d b

Probing Probing how to: –First probe - given a key k, hash to h(k) –Second probe - if h(k) is occupied, try h(k) + f(1) –Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) –And so forth Probing properties –we force f(0) = 0 –the i th probe is to (h(k) + f(i)) mod size –if i reaches size - 1, the probe has failed –depending on f(), the probe may fail sooner –long sequences of probes are costly!

Linear Probing Probe sequence is –h(k) mod size –h(k) + 1 mod size –h(k) + 2 mod size –… findEntry using linear probing: f(i) = i bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash 1 (k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

Linear Probing Example probes: insert(55) 55%7 = insert(76) 76%7 = insert(93) 93%7 = insert(40) 40%7 = insert(47) 47%7 = insert(10) 10%7 =

Load Factor in Linear Probing For any < 1, linear probing will find an empty slot Search cost (for large table sizes) –successful search: –unsuccessful search: Linear probing suffers from primary clustering Performance quickly degrades for > 1/2

Quadratic Probing Probe sequence is –h(k) mod size –(h(k) + 1) mod size –(h(k) + 4) mod size –(h(k) + 9) mod size –… findEntry using quadratic probing: f(i) = i 2 bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash 1 (k), numProbes = 0; do { entry = &table[probePoint]; numProbes++; probePoint = (probePoint + 2*numProbes - 1) % size; } while (!entry->isEmpty() && entry->key != key); return !entry->isEmpty(); }

Quadratic Probing Example probes: insert(76) 76%7 = insert(40) 40%7 = insert(48) 48%7 = insert(5) 5%7 = insert(55) 55%7 =

Quadratic Probing Example  probes: insert(76) 76%7 = insert(47) 47%7 = 5  insert(93) 93%7 = insert(40) 40%7 = insert(35) 35%7 =

Quadratic Probing Succeeds (for  ½) If size is prime and  ½, then quadratic probing will find an empty slot in size/2 probes or fewer. –show for all 0  i, j  size/2 and i  j (h(x) + i 2 ) mod size  (h(x) + j 2 ) mod size –by contradiction: suppose that for some i, j: (h(x) + i 2 ) mod size = (h(x) + j 2 ) mod size i 2 mod size = j 2 mod size (i 2 - j 2 ) mod size = 0 [(i + j)(i - j)] mod size = 0 –but how can i + j = 0 or i + j = size when i  j and i,j  size/2 ? –same for i - j mod size = 0

Quadratic Probing May Fail (for > ½) For any i larger than size/2, there is some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!

Load Factor in Quadratic Probing For any  ½, quadratic probing will find an empty slot; for greater, quadratic probing may find a slot Quadratic probing does not suffer from primary clustering Quadratic probing does suffer from secondary clustering –How could we possibly solve this?

Double Hashing f(i) = i  hash 2 (x) Probe sequence is –h 1 (k) mod size –(h 1 (k) + 1  h 2 (x)) mod size –(h 1 (k) + 2  h 2 (x)) mod size –… Code for finding the next linear probe: bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash 1 (k), hashIncr = hash 2 (k); do { entry = &table[probePoint]; probePoint = (probePoint + hashIncr) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

A Good Double Hash Function… …is quick to evaluate. …differs from the original hash function. …never evaluates to 0 (mod size). One good choice is to choose prime R < size and: hash 2 (x) = R - (x mod R)

Double Hashing Double Hashing Example probes: insert(55) 55%7 = (55%5) = insert(76) 76%7 = insert(93) 93%7 = insert(40) 40%7 = insert(47) 47%7 = (47%5) = insert(10) 10%7 =

Load Factor in Double Hashing For any < 1, double hashing will find an empty slot (given appropriate table size and hash 2 ) Search cost appears to approach optimal (random hash): –successful search: –unsuccessful search: No primary clustering and no secondary clustering One extra hash calculation

delete(2) find(7) Where is it?! Deletion in Closed Hashing Must use lazy deletion! On insertion, treat a deleted item as an empty slot

The Squished Pigeon Principle An insert using closed hashing cannot work with a load factor of 1 or more. An insert using closed hashing with quadratic probing may not work with a load factor of ½ or more. Whether you use open or closed hashing, large load factors lead to poor performance! How can we relieve the pressure on the pigeons? Hint: remember what happened when we overran a d-Heap’s array!

Rehashing When the load factor gets “too large” (over a constant threshold on ), rehash all the elements into a new, larger table: –takes O(n), but amortized O(1) as long as we (just about) double table size on the resize –spreads keys back out, may drastically improve performance –gives us a chance to retune parameterized hash functions –avoids failure for closed hashing techniques –allows arbitrarily large tables starting from a small table –clears out lazily deleted items

To Do Finish Project II Read chapter 5 in the book

Coming Up Extendible hashing (hashing for HUGE data sets) Disjoint-set union-find ADT Project II due (Wednesday) Project III Handout (Wednesday) Quiz (Thursday)