CS5112: Algorithms and Data Structures for Applications

Slides:



Advertisements
Similar presentations
Data Structures and Functional Programming Computability Ramin Zabih Cornell University Fall 2012.
Advertisements

Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
CSE332: Data Abstractions Lecture 27: A Few Words on NP Dan Grossman Spring 2010.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Finding Similar Items.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Clustering and greedy algorithms Prof. Noah Snavely CS1114
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Voronoi diagrams and applications Prof. Ramin Zabih
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
External data structures
CSC 413/513: Intro to Algorithms NP Completeness.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Image segmentation Prof. Noah Snavely CS1114
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Clustering Prof. Ramin Zabih
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.
Bloom Filters An Introduction and Really Most Of It CMSC 491
The Stream Model Sliding Windows Counting 1’s
15-121: Introduction to Data Structures
Cryptographic Hash Function
CS 315 Data Structures B. Ravikumar Office: 116 I Darwin Hall Phone:
IGCSE 6 Cambridge Effectiveness of algorithms Computer Science
CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Catie Baker Spring 2015.
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
How will execution time grow with SIZE?
CS 332: Algorithms Hash Tables David Luebke /19/2018.
CS 213: Data Structures and Algorithms
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Randomized Algorithms CS648
Hashing CS2110.
Edge computing (1) Content Distribution Networks
Preprocessing to accelerate 1-NN
CMSC201 Computer Science I for Majors Lecture 19 – Recursion
Data Structures: Introductory lecture
Hashing.
CSE 373 Data Structures and Algorithms
Quickselect Prof. Noah Snavely CS1114
CS5112: Algorithms and Data Structures for Applications
Database Systems (資料庫系統)
Searching, Sorting, and Asymptotic Complexity
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Sorting, selecting and complexity
CS5112: Algorithms and Data Structures for Applications
Spreadsheets, Modelling & Databases
CS5112: Algorithms and Data Structures for Applications
CSE 326: Data Structures Lecture #12 Hashing II
COP3530- Data Structures Introduction
CS 150: Computing - From Ada to the Web
CSE 326: Data Structures Lecture #14
Lecture-Hashing.
Presentation transcript:

CS5112: Algorithms and Data Structures for Applications Lecture 4.1: Applications of hashing Ramin Zabih Some figures from Wikipedia/Google image search

Administrivia Web site is: https://github.com/cornelltech/CS5112-F18 As usual, this is pretty much all you need to know HW 1 at Thursday 11:59PM Also very high tech!

Quiz 1 comments Overall people did pretty well There was a Dijkstra question that required some thought Might be on prelim/final in some form? High tech solution seemed to work well Reminder: we will drop your lowest quiz

Homework comments Getting the course staff is slower than we hoped Slack should be your primary contact Each student has 1 slip day over the semester For a pair, you can use a single day (not 2 days) You need to tell us which student to charge it to If you don’t, we will ask you, and eventually charge it to both of you

Today Fun applications of hashing! Greg on cryptocurrency Lots of billion-dollar ideas Greg on cryptocurrency

Collisions are an issue See HW 1

Can you determine if there are collisions? Given a hashing function ℎ from bit strings to bit strings No limits on the length of input or output Recall: cryptographic hash functions shouldn’t have collisions Two inputs with same output: ℎ 𝑠 =ℎ( 𝑠 ′ ) Can we tell this by inspecting ℎ?

Different excuses for failure Garey & Johnson, Computers and Intractability

Uncomputable vs intractable Uncomputable: proven to be this is impossible Determine if ℎ has any collisions Almost any question about a program Some very subtle problems where the input size is unbounded Intractable: proven at least as hard as famous open problems Technically “NP-hard” Almost any question about a graph, such as coloring A tractable graph problem is pure gold! Many problems in cryptography

Uses in CS of hardness results Very important in many applications Use case 1: hard problems can help you Want to show that if you could break a code you could also solve a famous open problem (e.g. factoring efficiently) Mostly shows up in adversarial situations Use case 2: hard problems avoid wasting time Showing that a problem is hard will keep people from working on it Amusingly enough, sometimes it shows publications are wrong

What cool stuff can we do with hashing? Back to the fun part… What cool stuff can we do with hashing?

Bloom filters Suppose you are processing items, most of them are cheap but a few of them are very expensive. Can we quickly figure out if an item is expensive? Could store the expensive items in an associative array Or use a binary valued hash table? Efficient way to find out if an item might be expensive We will query set membership but allow false positives I.e. the answer to 𝑠∈𝑆 is either ‘possibly’ or ‘definitely not’ Use a few hash functions ℎ 𝑖 and bit array 𝐴 To insert 𝑠 we set 𝐴 ℎ 𝑖 (𝑠) = 1 ∀ 𝑖

Bloom filter example Example has 3 hash functions and 18 bit array {𝑥,𝑦,𝑧} are in the set, 𝑤 is not Bits are (sort of) signature Figure by David Eppstein, https://commons.wikimedia.org/w/index.php?curid=2609777

Application: web caching CDN’s, like Akamai, make the web work (~70% of traffic) About 75% of URL’s are ‘one hit wonders’ Never looked at again by anyone Let’s not do the work to put these in the disk cache! Cache on second hit Use a Bloom filter to record URL’s that have been accessed A one hit wonder will not be in the Bloom filter See: Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), "Algorithmic nuggets in content delivery" (PDF), SIGCOMM Computer Communication Review, New York, NY, USA,45 (3): 52–66

Bloom filters really work! Figures from: Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), "Algorithmic nuggets in content delivery" (PDF), SIGCOMM Computer Communication Review, New York, NY, USA,45 (3): 52–66

Cool facts about Bloom filters You don’t need to build different hash functions, you can use a single one and divide its output into fields (usually) Can calculate probability of false positives and keep it low Time to add an element to the filter, or check if an element is in the filter, is independent of the size of the element (!) You can estimate the size of the union of two sets from the bitwise OR of their Bloom filters

MinHash Suppose you want to figure out how similar two sets are Jacard similarity measure is 𝐽 𝐴,𝐵 = 𝐴∩𝐵 |𝐴∪𝐵| This is 0 when disjoint and 1 when identical Define ℎ 𝑚𝑖𝑛 𝑆 to be the element of 𝑆 with the smallest value of the hash function ℎ, i.e. ℎ 𝑚𝑖𝑛 𝑆 = arg min s∈𝑆 ℎ(𝑠) This uses hashing to compute a set’s “signature” Probability that ℎ 𝑚𝑖𝑛 𝐴 = ℎ 𝑚𝑖𝑛 𝐵 is 𝐽(𝐴,𝐵) Do this with a bunch of different hash functions

MinHash applications Plagiarism detection in articles Collaborative filtering! Amazon, NetFlix, etc.

Distributed hash tables (DHT) BitTorrent, etc. Given a file name and its data, store/retrieve it in a network Compute the hash of the file name This maps to a particular processor, which holds the file