CSE373: Data Structures & Algorithms Lecture 12: Hash Tables

Slides:



Advertisements
Similar presentations
Hashing as a Dictionary Implementation
Advertisements

CSE332: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010.
Dictionaries and Hash Tables1  
Lecture 11 March 5 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CS261 Data Structures Hash Tables Concepts. Goals Hash Functions Dealing with Collisions.
Hash Tables1   © 2010 Goodrich, Tamassia.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
CSE373: Data Structures & Algorithms Lecture 11: Hash Tables Dan Grossman Fall 2013.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables Kevin Quinn Fall 2015.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
Hash Tables 1/28/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
CSE373: Data Structures & Algorithms Priority Queues
Design & Analysis of Algorithm Hashing
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSCI 210 Data Structures and Algorithms
School of Computer Science and Engineering
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
CSE 373: Hash Tables Hunter Zahn Summer 2016 UW CSE 373, Summer 2016.
Instructor: Lilian de Greef Quarter: Summer 2017
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash Tables 3/25/15 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Searching.
CSE332: Data Abstractions Lecture 10: More B Trees; Hashing
Hash Table.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Dictionaries and Hash Tables
Introduction to Algorithms 6.046J/18.401J
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE332: Data Abstractions Lecture 11: Hash Tables
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hash Tables and Associative Containers
Dictionaries and Hash Tables
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Hashing.
EE 312 Software Design and Implementation I
Data Structures – Week #7
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Data Structures and Algorithm Analysis Hashing
EE 312 Software Design and Implementation I
Dictionaries and Hash Tables
Presentation transcript:

CSE373: Data Structures & Algorithms Lecture 12: Hash Tables Lauren Milne Summer 2015

Announcements Homework 3 due today Homework 4 (hash tables) out today, partner programming assignment, find a partner by Monday

Motivating Hash Tables 9/11/2018 Motivating Hash Tables For a dictionary with n key, value pairs insert find delete Unsorted linked-list O(1) O(n) O(n) Unsorted array O(1) O(n) O(n) Sorted linked list O(n) O(n) O(n) Sorted array O(n) O(log n) O(n) Balanced tree O(log n) O(log n) O(log n) Magic array O(1) O(1) O(1) Sufficient “magic”: Use key to compute array index for an item in O(1) time [doable] Have a different index for every item [magic] balanced tree is almost there infinite possibilities for key values, so hard to have an different index for each to map to

Hash Tables hash table hash function: index = h(key) Aim for constant-time (i.e., O(1)) find, insert, and delete “On average” under some often-reasonable assumptions A hash table is an array of some fixed size Basic idea: TableSize –1 hash function: index = h(key) hash table key space (e.g., integers, strings) …

Hash Tables vs. Balanced Trees In terms of a Dictionary ADT for just insert, find, delete, hash tables and balanced trees are just different data structures Hash tables O(1) on average (assuming few collisions) Balanced trees O(log n) worst-case Constant-time is better, right? Yes, but you need “hashing to behave” (must avoid collisions) Yes, but findMin, findMax, predecessor, and successor go from O(log n) to O(n), printSorted from O(n) to O(n log n) Why your textbook considers this to be a different ADT

Hash Tables There are m possible keys (m typically large, even infinite) We expect our table to have only n items n is much less than m (often written n << m) Many dictionaries have this property Compiler: All possible identifiers allowed by the language vs. those used in some file of one program Database: All possible student names vs. students enrolled AI: All possible chess-board configurations vs. those considered by the current player …

Hash functions hash table hash function: index = h(key) TableSize –1 An ideal hash function: Fast to compute “Rarely” hashes two “used” keys to the same index Often impossible in theory but easy in practice Will handle collisions in next lecture hash table … hash function: index = h(key) TableSize –1 key space (e.g., integers, strings)

Who hashes what? Hash tables can be generic To store elements of type E, we just need E to be: Hashable: convert any E to an int Comparable: order any two E (only when dictionary) When hash tables are a reusable library, the division of responsibility generally breaks down into two roles: E int table-index collision? collision resolution client hash table library We will learn both roles, but most programmers “in the real world” spend more time as clients while understanding the library

More on roles Some ambiguity in terminology on which parts are “hashing” E int table-index collision? collision resolution client hash table library “hashing”? “hashing”? Two roles must both contribute to minimizing collisions (heuristically) Client should aim for different ints for expected items Avoid “wasting” any part of E or the 32 bits of the int Library should aim for putting “similar” ints in different indices Conversion to index is almost always “mod table-size” Using prime numbers for table-size is common

What to hash? We will focus on the two most common things to hash: ints and strings For objects with several fields, usually best to have most of the “identifying fields” contribute to the hash to avoid collisions Example: class Person { String first; String middle; String last; Date birthdate; } An inherent trade-off: hashing-time vs. collision-avoidance Bad/Good idea(?): Use only first name Use only middle initial Admittedly, what-to-hash-with is often unprincipled 

Hashing integers 1 key space = integers 2 Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 1 2 3 4 5 6 7 8 9

Hashing integers 1 key space = integers 2 Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 1 2 3 4 5 6 7 8 9

Hashing integers key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 1 2 3 4 5 6 7 8 18 9

Hashing integers key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 1 41 2 3 4 5 6 7 8 18 9

Hashing integers key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 1 41 2 3 4 34 5 6 7 8 18 9

Hashing integers key space = integers Simple hash function: h(key) = key % TableSize Client: f(x) = x Library g(x) = x % TableSize Fairly fast and natural Example: TableSize = 10 Insert 7, 18, 41, 34, 10 (As usual, ignoring data “along for the ride”) 10 1 41 2 3 4 34 5 6 7 8 18 9

Collision-avoidance With “x % TableSize” the number of collisions depends on the ints inserted (obviously) TableSize Larger table-size tends to help, but not always Example: 70, 24, 56, 43, 10 with TableSize = 10 and TableSize = 60 Technique: Pick table size to be prime. Why? Real-life data tends to have a pattern “Multiples of 61” are probably less likely than “multiples of 60” Next lecture shows one collision-handling strategy does provably well with prime table size

More on prime table size If TableSize is 60 and… Lots of data items are multiples of 2, wasting 50% of table Lots of data items are multiples of 5, wasting 80% of table Lots of data items are multiples of 10, wasting 90% of table If TableSize is 61… Collisions can still happen but 2, 4, 6, 8, … will fill table Collisions can still happen, but 5, 10, 15, 20, … will fill table Collisions can still happen but 10, 20, 30, 40, … will fill table This “table-filling” property happens whenever the multiple and the table-size have a greatest-common-divisor of 1

Okay, back to the client If keys aren’t ints, client must convert to an int Trade-off: speed vs distinct keys hashing to distinct ints Strings Key space K = s0s1s2…sm-1 (where si are chars: si  [0,52] or si  [0,256] or si  [0,216]) Some choices: Which avoid collisions best? h(K) = s0 % TableSize h(K) = ( ) % TableSize h(K) = ( ) % TableSize alphabetic (upper and low case) mapping char = 1 byte (8 bits) represent values from 0-255 unicode (UCS-2) 2 bytes per character = values range from 0 to 2^16 odd prime, okay to get multiplication overload, don’t want to lose uniqueness 31 multiplication replaced by a shift and a subtraction 31*i == (i<<5)-1 multiplying by prime more likely to be unique old problem if you have a divisor in common with your table size, you are wasting part of your table

Specializing hash functions How might you hash differently if all your strings were web addresses (URLs)? ignore the http://www

Combining hash functions A few rules of thumb / tricks: Use all 32 bits (careful, that includes negative numbers) Use different overlapping bits for different parts of the hash This is why a factor of 31i works better than 256i Example: “abcde” and “ebcda” When smashing two hashes into one hash, use bitwise-xor bitwise-and produces too many 0 bits bitwise-or produces too many 1 bits Rely on expertise of others; consult books and other resources If keys are known ahead of time, choose a perfect hash mulitply by 2 = shift over by one bit (256=by 5 bits)

One expert suggestion int result = 17; foreach field f int fieldHashcode = boolean: (f ? 1: 0) byte, char, short, int: (int) f long: (int) (f ^ (f >>> 32)) float: Float.floatToIntBits(f) double: Double.doubleToLongBits(f), then above Object: object.hashCode( ) result = 31 * result + fieldHashcode

Hashing and comparing Need to emphasize a critical detail: We initially hash key E to get a table index To check an item is what we are looking for, compareTo E Does it have an equal key? So a hash table needs a hash function and a comparator The Java library uses a more object-oriented approach: each object has methods equals and hashCode class Object { boolean equals(Object o) {…} int hashCode() {…} … }

Equal Objects Must Hash the Same The Java library make a crucial assumption clients must satisfy And all hash tables make analogous assumptions Object-oriented way of saying it: If a.equals(b), then a.hashCode()==b.hashCode() Why is this essential? Why is this up to the client? So always override hashCode correctly if you override equals Many libraries use hash tables on your objects

Example class MyDate { int month; int year; int day; boolean equals(Object otherObject) { if(this==otherObject) return true; // common? if(otherObject==null) return false; if(getClass()!=other.getClass()) return false; return month = otherObject.month && year = otherObject.year && day = otherObject.day; }

Example class MyDate { int month; int year; int day; boolean equals(Object otherObject) { if(this==otherObject) return true; // common? if(otherObject==null) return false; if(getClass()!=other.getClass()) return false; return month = otherObject.month && year = otherObject.year && day = otherObject.day; } // wrong: must also override hashCode!

Tougher example Suppose you had a Fraction class where equals returned true for 1/2 and 3/6, etc. Then must override hashCode and cannot hash just based on the numerator and denominator Need 1/2 and 3/6 to hash to the same int If you write software for a living, you are less likely to implement hash tables from scratch than you are likely to encounter this issue

Conclusions and notes on hashing The hash table is one of the most important data structures Supports only find, insert, and delete efficiently Have to search entire table for other operations Important to use a good hash function Important to keep hash table at a good size Side-comment: hash functions have uses beyond hash tables Examples: Cryptography, check-sums Big remaining topic: Handling collisions