CS5112: Algorithms and Data Structures for Applications

Slides:



Advertisements
Similar presentations
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Advertisements

Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Lecture 20: Cluster Validation
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Clustering Prof. Ramin Zabih
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
KNN & Naïve Bayes Hongning Wang
Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Lightstick orientation; line fitting
Data Transformation: Normalization
k-Nearest neighbors and decision tree
May 17th – Comparison Sorts
Hash table CSC317 We have elements with key and satellite data
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CS 332: Algorithms Hash Tables David Luebke /19/2018.
School of Computing Clemson University Fall, 2012
Randomized Algorithms
Hashing CENG 351.
Hash functions Open addressing
Clustering Evaluation The EM Algorithm
CS223 Advanced Data Structures and Algorithms
Lecture 11: Nearest Neighbor Search
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
K Nearest Neighbor Classification
Randomized Algorithms CS648
Introduction to Database Systems
Hidden Markov Models Part 2: Algorithms
Unit-2 Divide and Conquer
Randomized Algorithms
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Preprocessing to accelerate 1-NN
Data Structures Review Session
Locality Sensitive Hashing
Instance Based Learning
Data Structures Sorting Haim Kaplan & Uri Zwick December 2014.
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Greedy clustering methods
CS5112: Algorithms and Data Structures for Applications
CS 188: Artificial Intelligence Fall 2008
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
Analysis of Algorithms
Nearest Neighbors CSC 576: Data Mining.
CS 1114: Sorting and selection (part two)
CS5112: Algorithms and Data Structures for Applications
CS223 Advanced Data Structures and Algorithms
Minwise Hashing and Efficient Search
Clustering.
Quicksort and Randomized Algs
Design and Analysis of Algorithms
Data Structures and Algorithm Analysis Hashing
Quicksort and selection
DATA STRUCTURES-COLLISION TECHNIQUES
Getting to the bottom, fast
DATA STRUCTURES-COLLISION TECHNIQUES
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Divide and Conquer Merge sort and quick sort Binary search
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Sorting and selection Prof. Noah Snavely CS1114
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

CS5112: Algorithms and Data Structures for Applications Lecture 10: From hashing to machine learning Ramin Zabih Some figures from Wikipedia/Google image search

Administrivia HW comments and minor corrections on Slack HW3 coming soon Non-anonymous survey coming re: speed of course, etc.

Office hours Prof Zabih: after lecture or by appointment Tuesdays 11:30-12:30 in Bloomberg 277 with @Julia Wednesdays 2:30-3:30 in Bloomberg 277 with @irisz Wednesdays 3:30-4:30 in Bloomberg 277 with @Ishan Thursdays 10-12 in Bloomberg 267 with @Fei Li

Today Universal hashing From universal to perfect Learning to hash Recap: randomized quicksort From universal to perfect Learning to hash Nearest neighbor search

Quick impossibility proof Generally, proofs that there is no algorithm that can do X are quite complicated There is a simple one that follows naturally from Richard’s lecture last Thursday on compression Do (lossless) compression algorithm always work? I.e., reduce the size of a file?

Perfect & minimal hashing Choice of hash functions is data-dependent! Let’s try to hash 4 English words into the buckets 0,1,2,3 E.g., to efficiently compress a sentence Words: {“banana”, “glib”, “epic”, “food”} Can efficiently say sentence like “epic glib banana food” = 3,2,1,0 Can you construct a minimal perfect hash function that maps each of these to a different bucket? Needs to be efficient, not (e.g.) a list of cases

Recall: Quicksort Basic idea: pick a pivot element 𝑥 of the array, use it to divide the array into the elements ≤𝑥 or >𝑥, then recurse Worst case: pivot is min or max element Complexity is 𝑂 𝑛 2 This often happens in practice (why?) What is the average case? In general, arguments about average case are much harder

Expectation of random variables Randomized quicksort: pick a random pivot Equivalent to randomizing the input then running (usual) quicksort Recall: a (discrete) random variable has different possible values with different probabilities Think of a coin, a die, or a roulette wheel Visualize as a histogram The expectation of a random variable is the weighted sum, where each value is weighted by its probability Example: expectation of a 6 sided die is 3.5 Image from: https://www.omtexclasses.com/2015/02/the-distribution-of-random-variable.html

Average case quicksort The number of comparisons performed by randomized quicksort on an input of size 𝑛 is a random variable Small chance of 𝑂(𝑛) comparisons (great luck with pivot!) or 𝑂( 𝑛 2 ) comparisons (terrible luck!) With some effort you can show that the expectation of this random variable is 𝑂(𝑛 log⁡𝑛)

Universal hashing We can randomly generate a hash function ℎ This is NOT the same as the hash function being random Hash function is deterministic! Can re-do this if it turns out to have lots of collisions Assume input keys of fixed size (e.g., 32 bit numbers) Ideally ℎ will spread out the keys uniformly 𝑃 ℎ 𝑥 =ℎ 𝑦 | 𝑥≠𝑦 ≤ 1 2 32 Think of this as fixing 𝑥,𝑦|𝑥≠𝑦 and then picking ℎ randomly If we had such an ℎ, the expected number of collisions when we hash 𝑁 numbers is 𝑁 2 32

Universal hashing by matrix multiplication This would be of merely theoretical interest if we could not generate such an ℎ There’s a simple technique, not efficient enough to be practical More practical versions follow the same idea Now assume the inputs/outputs are 4 bit numbers/3 bit numbers respectively, i.e. inputs: 0-15, outputs: 0-7 We will randomly generate a 3x4 array of bits, and hash by ‘multiplying’ the input by this array

Universal hashing example We multiply using AND, and we add using parity Technically this is mod 2 arithmetic 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 0 = 0 1 0

Balls into bins There is an important underlying idea here Shows up surprisingly often Suppose we throw 𝑚 balls into 𝑛 bins Where for each ball we pick a bin at random How big do we need to make n so that with probability > 1 2 there are no collisions? This is the opposite of the birthday paradox Answer: need 𝑛 ≈ 𝑚 2 So to avoid collisions with probability ½ we need our hash table to be about the square of the number of elements

Perfect hashing from universal hashing We can use this to create a perfect hash function Generate a random hash function ℎ Technically, from a universal family (like binary matrices) Use a “big enough” hash table, from before I.e., size is square of the number of elements Then the chance of a collision is < ½

Nearest neighbor search Fundamental problem in machine learning/AI: find something in the data similar to a query Selected examples: predicting purchases (Amazon/Netflix), avoiding fraudulent credit card transactions, finding undervalued stocks, etc. Easiest way to do classification (map items to labels) Important application: density estimation Exact versus approximate algorithms Techniques are often classical CS, including hashing

Problem definitions Input: data items 𝑋= 𝑥 1 ,…, 𝑥 𝑁 , query 𝑞, distance function 𝑑𝑖𝑠𝑡(𝑞,𝑥) Typically in a vector space under Euclidean ( 𝑙 2 ) norm Nearest neighbor (NN) search: arg min 𝑥∈𝑋 𝑑𝑖𝑠𝑡(𝑞,𝑥) K-nearest neighbor (KNN): find the closest K data items R-near neighbor search: 𝑥 𝑑𝑖𝑠𝑡 𝑞,𝑥 ≤𝑅} Approximate NN: if 𝑥 ∗ is the NN, find {𝑥|𝑑𝑖𝑠𝑡 𝑞,𝑥 ≤ 1+𝜖 𝑑𝑖𝑠𝑡(𝑞, 𝑥 ∗ )}

Approximate NN via hashing Normally collisions make a hash function bad In this application, certain collisions are good! Main idea: hash the data points so that nearby items end up in the same bucket At query time, hash the query and rerank the bucket elements Most famous technique is Locality Sensitive Hashing (LSH)

NN Density estimation Lots of practical questions boil down to density estimation Even if you don’t explicitly say you’re doing it! “What do typical currency fluctuations look like?” Your classes generally have some density in feature space Hopefully they are compact and well-separated Given a new data point, which class does it belong to? One hypothesis per class, maximize 𝑃(𝑑𝑎𝑡𝑎|𝑐𝑙𝑎𝑠𝑠), This requires the density