CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2010.

Slides:

Advertisements

Similar presentations

Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.

Data Structures Using C++ 2E

Hashing as a Dictionary Implementation

Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.

© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.

1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.

Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.

Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Tirgul 8 Hash Tables (continued) Reminder Examples.

Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.

Hashing General idea: Get a large array

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.

§3 Separate Chaining ---- keep a list of all keys that hash to the same value struct ListNode; typedef struct ListNode *Position; struct HashTbl; typedef.

1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.

MA/CSSE 473 Day 28 Hashing review B-tree overview Dynamic Programming.

ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.

Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.

Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.

Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.

Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.

Comp 335 File Structures Hashing.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.

CSE373: Data Structures & Algorithms Lecture 11: Hash Tables Dan Grossman Fall 2013.

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012CSE 332 Data Abstractions, Summer

Hashing as a Dictionary Implementation Chapter 19.

CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2012.

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.

Hash Tables - Motivation

Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.

1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.

Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.

Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.

Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

Tirgul 11 Notes Hash tables –reminder –examples –some new material.

Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,

CS261 Data Structures Hash Tables Open Address Hashing.

A Introduction to Computing II Lecture 11: Hashtables Fall Session 2000.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

COSC 1030 Lecture 10 Hash Table. Topics Table Hash Concept Hash Function Resolve collision Complexity Analysis.

Chapter 13 C Advanced Implementations of Tables – Hash Tables.

CSE373: Data Structures & Algorithms Lecture 13: Hash Collisions Lauren Milne Summer 2015.

1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.

Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

Slide 1 Hashing Slides adapted from various sources.

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables Kevin Quinn Fall 2015.

CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.

Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.

Fundamental Structures of Computer Science II

Instructor: Lilian de Greef Quarter: Summer 2017

CE 221 Data Structures and Algorithms

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables

CSE 373: Hash Tables Hunter Zahn Summer 2016 UW CSE 373, Summer 2016.

Instructor: Lilian de Greef Quarter: Summer 2017

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions

CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions

CSE332: Data Abstractions Lecture 11: Hash Tables

CSE373: Data Structures & Algorithms Lecture 13: Hash Tables

CSE373: Data Structures & Algorithms Lecture 13: Hash Tables

CSE 373: Data Structures and Algorithms

Presentation transcript:

CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2010

Hash Tables: Review Aim for constant-time (i.e., O(1)) find, insert, and delete –“On average” under some reasonable assumptions A hash table is an array of some fixed size –But growable as we’ll see Spring 20102CSE332: Data Abstractions E inttable-index collision? collision resolution client hash table library 0 … TableSize –1 hash table

Hash Tables: A Different ADT? In terms of a Dictionary ADT for just insert, find, delete, hash tables and balanced trees are just different data structures –Hash tables O(1) on average (assuming few collisions) –Balanced trees O( log n) worst-case Constant-time is better, right? –Yes, but you need “hashing to behave” (collisions) –Yes, but findMin, findMax, predecessor, and successor go from O( log n) to O(n) Why your textbook considers this to be a different ADT Not so important to argue over the definitions Spring 20103CSE332: Data Abstractions

Collision resolution Collision: When two keys map to the same location in the hash table We try to avoid it, but number-of-keys exceeds table size So hash tables should support collision resolution –Ideas? Spring 20104CSE332: Data Abstractions

Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 Spring 20105CSE332: Data Abstractions 0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/

Separate Chaining Spring 20106CSE332: Data Abstractions 0 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ 10/ Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10

Separate Chaining Spring 20107CSE332: Data Abstractions 0 1/ 2 3/ 4/ 5/ 6/ 7/ 8/ 9/ 10/ 22/ Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10

Separate Chaining Spring 20108CSE332: Data Abstractions 0 1/ 2 3/ 4/ 5/ 6/ 7 8/ 9/ 10/ 22/ 107/ Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10

Separate Chaining Spring 20109CSE332: Data Abstractions 0 1/ 2 3/ 4/ 5/ 6/ 7 8/ 9/ 10/ / 22/ Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10

Separate Chaining Spring CSE332: Data Abstractions 0 1/ 2 3/ 4/ 5/ 6/ 7 8/ 9/ 10/ / / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10

Thoughts on chaining Worst-case time for find : linear –But only with really bad luck or bad hash function –So not worth avoiding (e.g., with balanced trees at each bucket) Beyond asymptotic complexity, some “data-structure engineering” may be warranted –Linked list vs. array vs. chunked list (lists should be short!) –Move-to-front (cf. Project 2) –Better idea: Leave room for 1 element (or 2?) in the table itself, to optimize constant factors for the common case A time-space trade-off… Spring CSE332: Data Abstractions

Time vs. space (constant factors only here) Spring CSE332: Data Abstractions 0 1/ 2 3/ 4/ 5/ 6/ 7 8/ 9/ 10/ / / 010/ 1/X 242 3/X 4/X 5/X 6/X 7107/ 8/X 9/X /

More rigorous chaining analysis Definition: The load factor,, of a hash table is Spring CSE332: Data Abstractions  number of elements Under chaining, the average number of elements per bucket is So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against ____ items Each successful find compares against _____ items

More rigorous chaining analysis Definition: The load factor,, of a hash table is Spring CSE332: Data Abstractions  number of elements Under chaining, the average number of elements per bucket is So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against items Each successful find compares against / 2 items

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, –try (h(key) + 1) % TableSize. If full, –try (h(key) + 2) % TableSize. If full, –try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 Spring CSE332: Data Abstractions 0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 838 9/

Alternative: Use empty space in the table Spring CSE332: Data Abstractions 0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ Another simple idea: If h(key) is already full, –try (h(key) + 1) % TableSize. If full, –try (h(key) + 2) % TableSize. If full, –try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10

Alternative: Use empty space in the table Spring CSE332: Data Abstractions 08 1/ 2/ 3/ 4/ 5/ 6/ 7/ Another simple idea: If h(key) is already full, –try (h(key) + 1) % TableSize. If full, –try (h(key) + 2) % TableSize. If full, –try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10

Alternative: Use empty space in the table Spring CSE332: Data Abstractions / 3/ 4/ 5/ 6/ 7/ Another simple idea: If h(key) is already full, –try (h(key) + 1) % TableSize. If full, –try (h(key) + 2) % TableSize. If full, –try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10

Alternative: Use empty space in the table Spring CSE332: Data Abstractions / 4/ 5/ 6/ 7/ Another simple idea: If h(key) is already full, –try (h(key) + 1) % TableSize. If full, –try (h(key) + 2) % TableSize. If full, –try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10

Open addressing This is one example of open addressing In general, open addressing means resolving collisions by trying a sequence of other positions in the table. Trying the next spot is called probing –Our i th probe was (h(key) + i) % TableSize This is called linear probing –In general have some probe function f and use h(key) + f(i) % TableSize Open addressing does poorly with high load factor –So want larger tables –Too many probes means no more O(1) Spring CSE332: Data Abstractions

Terminology We and the book use the terms –“chaining” or “separate chaining” –“open addressing” Very confusingly, –“open hashing” is a synonym for “chaining” –“closed hashing” is a synonym for “open addressing” (If it makes you feel any better, most trees in CS grow upside-down ) Spring CSE332: Data Abstractions

Other operations Okay, so insert finds an open table position using a probe function What about find ? –Must use same probe function to “retrace the trail” and find the data –Unsuccessful search when reach empty position What about delete ? –Must use “lazy” deletion. Why? –But here just means “no data here, but don’t stop probing” –Note: delete with chaining is plain-old list-remove Spring CSE332: Data Abstractions

(Primary) Clustering It turns out linear probing is a bad idea, even though the probe function is quick to compute (a good thing) Spring CSE332: Data Abstractions [R. Sedgewick] Tends to produce clusters, which lead to long probing sequences Called primary clustering Saw this starting in our example

Analysis of Linear Probing Trivial fact: For any < 1, linear probing will find an empty slot –It is “safe” in this sense: no infinite loop unless table is full Non-trivial facts we won’t prove: Average # of probes given (in the limit as TableSize →  ) –Unsuccessful search: –Successful search: This is pretty bad: need to leave sufficient empty space in the table to get decent performance (see chart) Spring CSE332: Data Abstractions

In a chart Linear-probing performance degrades rapidly as table gets full –(Formula assumes “large table” but point remains) By comparison, chaining performance is linear in and has no trouble with >1 Spring CSE332: Data Abstractions

Quadratic probing We can avoid primary clustering by changing the probe function A common technique is quadratic probing: –f(i) = i 2 –So probe sequence is: 0 th probe: h(key) % TableSize 1 st probe: (h(key) + 1) % TableSize 2 nd probe: (h(key) + 4) % TableSize 3 rd probe: (h(key) + 9) % TableSize … i th probe: (h(key) + i 2 ) % TableSize Intuition: Probes quickly “leave the neighborhood” Spring CSE332: Data Abstractions

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Quadratic Probing Example Spring CSE332: Data Abstractions TableSize=10 Insert:

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5)

Another Quadratic Probing Example Spring CSE332: Data Abstractions TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) Uh-oh: For all n, ((n*n) +5) % 7 is 0, 2, 5, or 6 Excel shows takes “at least” 50 probes and a pattern Proof uses induction and (n 2 +5) % 7 = ((n-7) 2 +5) % 7 In fact, for all c and k, (n 2 +c) % k = ((n-k) 2 +c) % k

From bad news to good news The bad news is: After TableSize quadratic probes, we will just cycle through the same indices The good news: –Assertion #1: If T = TableSize is prime and < ½, then quadratic probing will find an empty slot in at most T/2 probes –Assertion #2: For prime T and 0  i,j  T/2 where i  j, (h(key) + i 2 ) % T  (h(key) + j 2 ) % T –Assertion #3: Assertion #2 is the “key fact” for proving Assertion #1 So: If you keep < ½, no need to detect cycles Spring CSE332: Data Abstractions

Clustering reconsidered Quadratic probing does not suffer from primary clustering: no problem with keys initially hashing to the same neighborhood But it’s no help if keys initially hash to the same index –Called secondary clustering Can avoid secondary clustering with a probe function that depends on the key: double hashing… Spring CSE332: Data Abstractions

Double hashing Idea: –Given two good hash functions h and g, it is very unlikely that for some key, h(key) == g(key) –So make the probe function f(i) = i*g(key) Probe sequence: 0 th probe: h(key) % TableSize 1 st probe: (h(key) + g(key)) % TableSize 2 nd probe: (h(key) + 2*g(key)) % TableSize 3 rd probe: (h(key) + 3*g(key)) % TableSize … i th probe: (h(key) + i*g(key)) % TableSize Detail: Make sure g(key) can’t be 0 Spring CSE332: Data Abstractions

Double-hashing analysis Intuition: Since each probe is “jumping” by g(key) each time, we “leave the neighborhood” and “go different places from other initial collisions” But we could still have a problem like in quadratic probing where we are not “safe” (infinite loop despite room in table) –It is known that this cannot happen in at least one case: h(key) = key % p g(key) = q – (key % q) 2 < q < p p and q are prime Spring CSE332: Data Abstractions

More double-hashing facts Assume “uniform hashing” –Means probability of g(key1) % p == g(key2) % p is 1/p Non-trivial facts we won’t prove: Average # of probes given (in the limit as TableSize →  ) –Unsuccessful search (intuitive): –Successful search (less intuitive): Bottom line: unsuccessful bad (but not as bad as linear probing), but successful is not nearly as bad Spring CSE332: Data Abstractions

Charts Spring CSE332: Data Abstractions

Where are we? Chaining is easy – insert, find, delete proportion to load factor on average Open addressing uses probe functions, has clustering issues as table gets full –Why use it: Less memory allocation? Easier data representation? Now: –Growing the table when it gets too full –Relation between hashing/comparing and connection to Java Spring CSE332: Data Abstractions

Rehashing Like with array-based stacks/queues/lists, if table gets too full, create a bigger table and copy everything over Especially with chaining, we get to decide what “too full” means –Keep load factor reasonable (e.g., < 1)? –Consider average or max size of non-empty chains? –For open addressing, half-full is a good rule of thumb New table size –Twice-as-big is a good idea, except, uhm, that won’t be prime! –So go about twice-as-big –Can have a list of prime numbers in your code since you won’t grow more than times Spring CSE332: Data Abstractions

More on rehashing We double the size (rather than “add 1000”) to get good amortized guarantees (still promising to prove that later ) But one resize is an O(n) operation, involving n calls to the hash function (1 for each insert in the new table) Space/time tradeoff: Could store h(key) with each data item, but since rehashing is rare, this is probably a poor use of space –And growing the table is still O(n) Spring CSE332: Data Abstractions

Hashing and comparing Haven’t emphasized enough for a find or a delete of an item of type E, we hash E, but then as we go through the chain or keep probing, we have to compare each item we see to E. So a hash table needs a hash function and a comparator –In Project 2, you’ll use two function objects –The Java standard library uses a more OO approach where each object has an equals method and a hashCode method: Spring CSE332: Data Abstractions class Object { boolean equals(Object o) {…} int hashCode() {…} … }

Equal objects must hash the same The Java library (and your project hash table) make a very important assumption that clients must satisfy… OO way of saying it: If a.equals(b), then we must require a.hashCode()==b.hashCode() Function object way of saying i: If c.compare(a,b) == 0, then we must require h.hash(a) == h.hash(b) Why is this essential? Spring CSE332: Data Abstractions

Java bottom line Lots of Java libraries use hash tables, perhaps without your knowledge So: If you ever override equals, you need to override hashCode also in a consistent way –See CoreJava book, Chapter 5 for other “gotchas” with equals Spring CSE332: Data Abstractions

Bad Example Spring CSE332: Data Abstractions class PolarPoint { double r = 0.0; double theta = 0.0; void addToAngle(double theta2) { theta+=theta2; } … boolean equals(Object otherObject) { if(this==otherObject) return true; if(otherObject==null) return false; if(getClass()!=other.getClass()) return false; PolarPoint other = (PolarPoint)otherObject; double angleDiff = (theta – other.theta) % (2*Math.PI); double rDiff = r – other.r; return Math.abs(angleDiff) < && Math.abs(rDiff) < ; } // wrong: must override hashCode! } Think about using a hash table holding points

By the way: comparison has rules too We didn’t emphasize some important “rules” about comparison functions for: –all our dictionaries –sorting (next major topic) In short, comparison must impose a consistent, total ordering: For all a, b, and c, –If compare(a,b) 0 –If compare(a,b) == 0, then compare(b,a) == 0 –If compare(a,b) < 0 and compare(b,c) < 0, then compare(a,c) < 0 Spring CSE332: Data Abstractions

Final word on hashing The hash table is one of the most important data structures –Supports only find, insert, and delete efficiently Important to use a good hash function Important to keep hash table at a good size Side-comment: hash functions have uses beyond hash tables –Examples: Cryptography, check-sums Spring CSE332: Data Abstractions