CSE373: Data Structures & Algorithms Lecture 6: Hash Tables Kevin Quinn Fall 2015.

Slides:



Advertisements
Similar presentations
Hashing as a Dictionary Implementation
Advertisements

AVL Deletion: Case #1: Left-left due to right deletion Spring 2010CSE332: Data Abstractions1 h a Z Y b X h+1 h h+2 h+3 b ZY a h+1 h h+2 X h h+1 Same single.
CSE332: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2010.
Hashing CS 3358 Data Structures.
Dictionaries and Hash Tables1  
Lecture 10 Sept 29 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Lecture 11 March 5 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Hash Tables and Associative Containers CS-212 Dick Steflik.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Hash Tables1   © 2010 Goodrich, Tamassia.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
CSE373: Data Structures & Algorithms Lecture 11: Hash Tables Dan Grossman Fall 2013.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012CSE 332 Data Abstractions, Summer
© 2004 Goodrich, Tamassia Hash Tables1  
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2012.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
Hash Tables - Motivation
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Data Structure & Algorithm Lecture 8 – Hashing JJCAO Most materials are stolen from Prof. Yoram Moses’s course.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hash Tables From “Algorithms” (4 th Ed.) by R. Sedgewick and K. Wayne.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
Hashing O(1) data access (almost) -access, insertion, deletion, updating in constant time (on average) but at a price… references: Weiss, Goodrich & Tamassia,
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSE373: Data Structures & Algorithms Lecture 12: Hash Tables
Instructor: Lilian de Greef Quarter: Summer 2017
Searching.
CSE332: Data Abstractions Lecture 10: More B Trees; Hashing
Hash Table.
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Dictionaries and Hash Tables
Presentation transcript:

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables Kevin Quinn Fall 2015

Motivating Hash Tables For a dictionary with n key, value pairs insert find delete Unsorted linked-list O(1) O(n) O(n) Unsorted array O(1) O(n) O(n) Sorted linked list O(n) O(n) O(n) Sorted array O(n) O( log n) O(n) Balanced tree O( log n) O( log n) O( log n) Magic array O(1) O(1) O(1) Sufficient “magic”: –Use key to compute array index for an item in O(1) time [doable] –Have a different index for every item [magic] Fall 20152CSE373: Data Structures & Algorithms

Motivating Hash Tables Let’s say you are tasked with counting the frequency of integers in a text file. You are guaranteed that only the integers 0 through 100 will occur: For example: 5, 7, 8, 9, 9, 5, 0, 0, 1, 12 Result: 0  2 1  1 5  2 7  1 8  1 9  2 What structure is appropriate? Tree? List? Array? Fall 20153CSE373: Data Structures & Algorithms

Hash Tables Aim for constant-time (i.e., O(1)) find, insert, and delete –“On average” under some often-reasonable assumptions A hash table is an array of some fixed size Basic idea: Fall 20154CSE373: Data Structures & Algorithms 0 … TableSize –1 hash function: index = h(key) hash table key space (e.g., integers, strings)

Hash Tables vs. Balanced Trees In terms of a Dictionary ADT for just insert, find, delete, hash tables and balanced trees are just different data structures –Hash tables O(1) on average (assuming we follow good practices) –Balanced trees O( log n) worst-case Constant-time is better, right? –Yes, but you need “hashing to behave” (must avoid collisions) –Yes, but findMin, findMax, predecessor, and successor go from O( log n) to O(n), printSorted from O(n) to O(n log n) Why your textbook considers this to be a different ADT Fall 20155CSE373: Data Structures & Algorithms

Hash Tables There are m possible keys (m typically large, even infinite) We expect our table to have only n items n is much less than m (often written n << m) Many dictionaries have this property –Compiler: All possible identifiers allowed by the language vs. those used in some file of one program –Database: All possible student names vs. students enrolled –AI: All possible chess-board configurations vs. those considered by the current player –… Fall 20156CSE373: Data Structures & Algorithms

Hash functions An ideal hash function: Fast to compute “Rarely” hashes two “used” keys to the same index –Often impossible in theory but easy in practice –Will handle collisions in next lecture Fall 20157CSE373: Data Structures & Algorithms 0 … TableSize –1 hash function: index = h(key) hash table key space (e.g., integers, strings)

Who hashes what? Hash tables can be generic –To store elements of type E, we just need E to be: 1.Comparable: order any two E (as with all dictionaries) 2.Hashable: convert any E to an int When hash tables are a reusable library, the division of responsibility generally breaks down into two roles: Fall 20158CSE373: Data Structures & Algorithms We will learn both roles, but most programmers “in the real world” spend more time as clients while understanding the library E inttable-index collision? collision resolution client hash table library

More on roles Fall 20159CSE373: Data Structures & Algorithms Two roles must both contribute to minimizing collisions (heuristically) Client should aim for different ints for expected items –Avoid “wasting” any part of E or the 32 bits of the int Library should aim for putting “similar” int s in different indices –Conversion to index is almost always “mod table-size” –Using prime numbers for table-size is common E inttable-index collision? collision resolution client hash table library Some ambiguity in terminology on which parts are “hashing” “hashing”?

What to hash? We will focus on the two most common things to hash: ints and strings –For objects with several fields, usually best to have most of the “identifying fields” contribute to the hash to avoid collisions –Example: class Person { String first; String middle; String last; Date birthdate; } –An inherent trade-off: hashing-time vs. collision-avoidance Bad idea(?): Use only first name Good idea(?): Use only middle initial Admittedly, what-to-hash-with is often unprincipled  Fall CSE373: Data Structures & Algorithms

Hashing integers Fall CSE373: Data Structures & Algorithms key space = integers Simple hash function: –Client: g(x) = x –Library: f(x) = g(x) % TableSize –Fairly fast and natural Example: –TableSize = 10 –Insert 7, 18, 41, 34, 10 –Insert 44? –(As usual, only looking at keys, not values)

Collision-avoidance With “ x % TableSize ” the number of collisions depends on –the ints inserted (obviously) –TableSize Larger table-size tends to help, but not always –Example: 70, 17, 14, 9, 10 –What’s a table size that would work well? Poorly? TableSize = 9 and TableSize = 60 Technique: Pick table size to be prime. Why? –Real-life data tends to have a pattern –“Multiples of 61” are probably less likely than “multiples of 60” –Next lecture shows one collision-handling strategy does provably well with prime table size 12

Okay, back to the client If keys aren’t int s, the client must convert to an int –Why can’t the library do this for us? –Trade-off: speed versus distinct keys hashing to distinct int s Very important example: Strings –Key space K = s 0 s 1 s 2 …s m-1 (where s i are chars: s i  [0,52] or s i  [0,256] or s i  [0,2 16 ]) –Some choices: Which avoid collisions best? 1.h(K) = s 0 % TableSize 2.h(K) = % TableSize 3.h(K) = % TableSize 13

Specializing hash functions How might you hash differently if all your strings were web addresses (URLs)? Fall CSE373: Data Structures & Algorithms

Combining hash functions A few rules of thumb / tricks: 1.Use all 32 bits (careful, that includes negative numbers) 2.Use different overlapping bits for different parts of the hash –This is why a factor of 37 i works better than 256 i –Example: “abcde” and “ebcda” 3.When smashing two hashes into one hash, use bitwise-xor –bitwise-and produces too many 0 bits –bitwise-or produces too many 1 bits 4.Rely on expertise of others; consult books and other resources 5.If keys are known ahead of time, choose a perfect hash Fall CSE373: Data Structures & Algorithms

Combining Hashes Fall CSE373: Data Structures & Algorithms h1 = : (unicode for the int “3”) h2 = : (unicode for the char “e”) h1 AND h2h1 OR h2 h1 XOR h2

One expert suggestion int result = 17; foreach field f int fieldHashcode = boolean: (f ? 1: 0) byte, char, short, int: (int) f long: (int) (f ^ (f >>> 32)) float: Float.floatToIntBits(f) double: Double.doubleToLongBits(f), then above Object: object.hashCode( ) result = 31 * result + fieldHashcode Fall 2015CSE373: Data Structures & Algorithms17

Hashing and comparing Need to emphasize a critical detail: –We initially hash key E to get a table index –To check an item is what we are looking for, compareTo E Does it have an equal key? So a hash table needs a hash function and a comparator –The Java library uses a more object-oriented approach: each object has methods equals and hashCode Fall CSE373: Data Structures & Algorithms class Object { boolean equals(Object o) {…} int hashCode() {…} … }

Equal Objects Must Hash the Same The Java library make a crucial assumption clients must satisfy –And all hash tables make analogous assumptions Object-oriented way of saying it: If a.equals(b), then a.hashCode()==b.hashCode() Why is this essential? Why is this up to the client? So always override hashCode correctly if you override equals –Many libraries use hash tables on your objects Fall CSE373: Data Structures & Algorithms

By the way: comparison has rules too We have not emphasized important “rules” about comparison for: –Dictionaries –Sorting (future major topic) Comparison must impose a consistent, total ordering: For all a, b, and c, (reflexivity) : a.compareTo(a) == 0 (transitivity): If a.compareTo(b) < 0 and b.compareTo(c)<0, then a.compareTo(c) < 0 (symmertry): If a.compareTo(b) 0 If a.compareTo(b)== 0, then b.compareTo(a)==0 This is surprisingly awkward because of subclassing… 20

Example Fall CSE373: Data Structures & Algorithms class MyDate { int month; int year; int day; boolean equals(Object otherObject) { if(this==otherObject) return true; // common? if(otherObject==null) return false; if(getClass()!=other.getClass()) return false; return month = otherObject.month && year = otherObject.year && day = otherObject.day; } // wrong: must also override hashCode! }

Tougher example Suppose you had a Fraction class where equals returned true for 1/2 and 3/6, etc. Then must override hashCode and cannot hash just based on the numerator and denominator –Need 1/2 and 3/6 to hash to the same int If you write software for a living, you are less likely to implement hash tables from scratch than you are likely to encounter this issue Fall CSE373: Data Structures & Algorithms

Conclusions and notes on hashing The hash table is one of the most important data structures –Supports only find, insert, and delete efficiently –Have to search entire table for other operations Important to use a good hash function Important to keep hash table at a good size Side-comment: hash functions have uses beyond hash tables –Examples: Cryptography, check-sums Big remaining topic: Handling collisions Fall CSE373: Data Structures & Algorithms