CS5112: Algorithms and Data Structures for Applications

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Hashing General idea: Get a large array
Finding Similar Items.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Week 12 - Friday.  What did we talk about last time?  Finished hunters and prey  Class variables  Constants  Class constants  Started Big Oh notation.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
CS203 Lecture 14. Hashing An object may contain an arbitrary amount of data, and searching a data structure that contains many large objects is expensive.
Sets and Maps Chapter 9.
Advanced Algorithms Analysis and Design
Searching, Maps,Tries (hashing)
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Week 9 - Monday CS 113.
May 17th – Comparison Sorts
Database Management System
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSCI 210 Data Structures and Algorithms
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Cse 373 April 24th – Hashing.
Hash functions Open addressing
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hashing CS2110 Spring 2018.
Cache Memories September 30, 2008
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Computer Science 2 Hashing
Hash Tables.
Hashing CS2110.
Objective of This Course
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures Review Session
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Sets, Maps and Hash Tables
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Lecture 2- Query Processing (continued)
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Sorting, selecting and complexity
Sets and Maps Chapter 9.
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
CSE 326: Data Structures Lecture #12 Hashing II
Amortized Analysis and Heaps Intro
slides created by Marty Stepp
slides created by Ethan Apter
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Quicksort and selection
CSE 326: Data Structures Lecture #14
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

CS5112: Algorithms and Data Structures for Applications Lecture 3: Hashing Ramin Zabih Some figures from Wikipedia/Google image search

Administrivia Web site is: https://github.com/cornelltech/CS5112-F18 As usual, this is pretty much all you need to know Quiz 1 out today, due Friday 11:59PM Very high tech! Coverage through Greg’s lecture on Tuesday TA’s and consultants coming shortly We have a slack channel

Today Clinic this evening (here), Greg on hashing Associative arrays Efficiency: Asymptotic analysis, effects of locality Hashing Additional requirements for cryptographic hashing Fun applications of hashing! Lots of billion-dollar ideas

Associative arrays Fundamental data structure in CS Holds (key,value) pairs, a given key appears at most once API for associative arrays (very informal!) Insert(k,v,A)->A’, where A’ has the new pair (k,v) along with A Lookup(k,A)->v, where v is from the pair (k,v) in A Lots of details we will ignore today Avoiding duplication, raising exceptions, etc. “Key” question: how to do this fast

How computer scientists think about efficiency Two views: asymptotic and ‘practical’ Generally give the same result, but math vs engineering Asymptotic analysis, a.k.a. “big O” Mathematical treatment of algorithms Worst case performance Consider the limit as input gets larger and larger

Big O notation: main ideas 2 big ideas: [#1] Think about the worst case (don’t assume luck) [#2] Think about all hardware (don’t assume Intel/AMD) Example: find duplicates in array of 𝑛 numbers, dumbly For each element, scan the rest of the array We scan 𝑛−1, 𝑛−2, … elements [#2] So we examine 𝑖=1 𝑛−1 𝑖 elements, which is 𝑛−1 𝑛−2 2 Which is ugly… but what happens as 𝑛→∞? [#1] Write this as 𝑂( 𝑛 2 )

Some canonical complexity classes Constant time algorithms, 𝑂(1) Running time doesn’t depend on input Example: find the first element in an array Linear time algorithms, 𝑂(𝑛) Constant work per input item, in the worst case Example: find a particular item in the array Quadratic time algorithms, 𝑂( 𝑛 2 ) Linear work per input item, such as find duplicates Clever variants of quadratic time algorithms, 𝑂(𝑛 log⁡𝑛) A few will be discussed in the clinic tonight

Big O notation and its limitations Overall, practical performance correlates very strongly with asymptotic complexity (= big O) The exceptions to this are actually famous Warning: this does not mean that on a specific input an 𝑂(1) algorithm will be faster than an 𝑂( 𝑛 2 ) one!

Linked lists 8 4 1 3

Linked lists as memory arrays We’ll implement linked lists using a memory array This is very close to what the hardware does A linked list contains “cells” A value, and where the next cell is We will represent cells by a pair of adjacent array entries 1 2 3 4 5 6 7 8 9

Example 8 4 1 3 8 5 1 2 3 7 4 6 9

Locality and efficiency Locality is important for computation due to physics The amount of information you can pack into a given area The hardware is faster when your memory references are local in terms of time and space Time locality: access the same data repeatedly, then move on Space locality: access nearby data (in memory)

Memory hierarchy in a modern CPU

Complexity of associative array algorithms DATA STRUCTURE INSERT LOOKUP Linked list 𝑂(𝑛) Binary search tree (BST) Balanced BST 𝑂(log⁡𝑛) Hash table 𝑂(1) So, why use anything other than a hash table?

Hashing in one diagram

What makes a good hash function? Almost nobody writes their own hash function Like a random number generator, very easy to get this wrong! Deterministic Uniform With respect to your input! Technical term for this is entropy (Sometimes) invariant to certain changes Example: punctuation, capitalization, spaces

Examples of good and bad hash functions Suppose we want to build a hash table for CS5112 students Is area code a good hash function? How about zip code? Social security numbers? https://www.ssa.gov/history/ssn/geocard.html What is the best and worst hash function you can think of?

Cryptographic hashing Sample application: bragging that you’ve solved HW1 How to show this without sharing your solution? Given the output, hard to compute an input with that output Given 𝑚=ℎ𝑎𝑠ℎ(𝑠) hard to find 𝑠 ′ :𝑚=ℎ𝑎𝑠ℎ( 𝑠 ′ ) Sometimes called a 1-way function Given the input, hard to find a matching input Given 𝑠 hard to find 𝑠 ′ :ℎ𝑎𝑠ℎ 𝑠 =ℎ𝑎𝑠ℎ 𝑠 ′ Hard to find two inputs with same output: ℎ𝑎𝑠ℎ 𝑠 =ℎ𝑎𝑠ℎ( 𝑠 ′ )

What does “hard to find” mean? Major topic, center of computational complexity Loosely speaking, we can’t absolutely prove this But we can show that if we could solve one problem, we could solve another problem that is widely believed to be hard Because lots of people have tried to solve it and failed! This proves that one problem is at least as hard as another “Problem reduction”

Handling collisions More common than you think! Birthday paradox Example: 1M buckets and 2,450 keys uniformly distributed 95% chance of a collision Easiest solution is chaining E.g. with linked lists

What cool stuff can we do with hashing? Now for the fun part… What cool stuff can we do with hashing?

Rabin-Karp string search Find one string (“pattern”) in another Naively we repeatedly shift the pattern Example: To find “greg” in “richardandgreg” we compare greg against “rich”, “icha”, “char”, etc. (‘shingles’ at the word level) Instead let’s use a hash function ℎ We first compare ℎ(“greg”) with ℎ(“rich”), then ℎ(“icha”), etc. Only if the hash values are equal do we look at the string Because 𝑥 = 𝑦⇒ℎ(𝑥) = ℎ(𝑦) (but not ⇐ of course!)

Rolling hash functions To make this computationally efficient we need a special kind of hash function ℎ As we go through “richardandgreg” looking for “greg” we will be computing ℎ on consecutive strings of the same length There are clever ways to do this, but to get the flavor of them here is a naïve way that mostly works Take the ASCII values of all the characters and multiply them Reduce this modulo something reasonable

Large backups How do we backup all the world’s information? Tape robots! VERY SLOW access

Bloom filters Suppose you are processing items, most of them are cheap but a few of them are very expensive. Can we quickly figure out if an item is expensive? Could store the expensive items in an associative array Or use a binary valued hash table? Efficient way to find out if an item might be expensive We will query set membership but allow false positives I.e. the answer to 𝑠∈𝑆 is either ‘possibly’ or ‘definitely not’ Use a few hash functions ℎ 𝑖 and bit array 𝐴 To insert 𝑠 we set 𝐴 ℎ 𝑖 (𝑠) = 1 ∀ 𝑖

Bloom filter example Example has 3 hash functions and 18 bit array {𝑥,𝑦,𝑧} are in the set, 𝑤 is not Figure by David Eppstein, https://commons.wikimedia.org/w/index.php?curid=2609777

Application: web caching CDN’s, like Akamai, make the web work (~70% of traffic) About 75% of URL’s are ‘one hit wonders’ Never looked at again by anyone Let’s not do the work to put these in the disk cache! Cache on second hit Use a Bloom filter to record URL’s that have been accessed A one hit wonder will not be in the Bloom filter See: Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), "Algorithmic nuggets in content delivery" (PDF), SIGCOMM Computer Communication Review, New York, NY, USA,45 (3): 52–66

Bloom filters really work! Figures from: Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), "Algorithmic nuggets in content delivery" (PDF), SIGCOMM Computer Communication Review, New York, NY, USA,45 (3): 52–66

Cool facts about Bloom filters You don’t need to build different hash functions, you can use a single one and divide its output into fields (usually) Can calculate probability of false positives and keep it low Time to add an element to the filter, or check if an element is in the filter, is independent of the size of the element (!) You can estimate the size of the union of two sets from the bitwise OR of their Bloom filters

MinHash Suppose you want to figure out how similar two sets are Jacard similarity measure is 𝐽 𝐴,𝐵 = 𝐴∩𝐵 |𝐴∪𝐵| This is 0 when disjoint and 1 when identical Define ℎ 𝑚𝑖𝑛 𝑆 to be the element of 𝑆 with the smallest value of the hash function ℎ, i.e. ℎ 𝑚𝑖𝑛 𝑆 = arg min s∈𝑆 ℎ(𝑠) This uses hashing to compute a set’s “signature” Probability that ℎ 𝑚𝑖𝑛 𝐴 = ℎ 𝑚𝑖𝑛 𝐵 is 𝐽(𝐴,𝐵) Do this with a bunch of different hash functions

MinHash applications Plagiarism detection in articles Collaborative filtering! Amazon, NetFlix, etc.

Distributed hash tables (DHT) BitTorrent, etc. Given a file name and its data, store/retrieve it in a network Compute the hash of the file name This maps to a particular processor, which holds the file