Randomized Algorithms Part 3 William Cohen 1. Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today:

Slides:



Advertisements
Similar presentations
Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Efficiently searching for similar images (Kristen Grauman)
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Similarity Search in High Dimensions via Hashing
Unsupervised Learning with Artificial Neural Networks The ANN is given a set of patterns, P, from space, S, but little/no information about their classification,
Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.
Randomized / Hashing Algorithms
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Large-scale matching CSE P 576 Larry Zitnick
Robust and large-scale alignment Image from
Support Vector Machines and Kernel Methods
K nearest neighbor and Rocchio algorithm
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Dimensionality Reduction and Embeddings
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Data-Powered Algorithms Bernard Chazelle Princeton University Bernard Chazelle Princeton University.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Navigation Jeremy Wyatt School of Computer Science University of Birmingham.
1 Lecture 18 Syntactic Web Clustering CS
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Advanced Multimedia Text Classification Tamara Berg.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
AN ORTHOGONAL PROJECTION
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Perceptrons and Linear Classifiers William Cohen
Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Outline Problem Background Theory Extending to NLP and Experiment
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.
Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.
Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Locality-sensitive hashing and its applications
Randomized Algorithms Graph Algorithms
Graph Algorithms (plus some review….)
Bloom Filters - RECAP.
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
It Probably Works Jerome Vogt.
Locality Sensitive Hashing
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Randomized Algorithms Part 3 William Cohen 1

Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today: – Review and discussion – More on count-min – Morris counters – locality sensitive hashing 2

Locality Sensitive Hashing (LSH) Bloom filters: – set of objects mapped to a bit vector – allows: add to set, check containment Countmin sketch: – sparse vector, x mapped to small dense matrix – allows: recover approximate value of x i especially useful for largest values Locality sensitive hash: – feature vector, x mapped to bit vector, bx – allows: compute approximate similarity of bx and by 3

LSH: key ideas Goal: – map feature vector x to bit vector bx – ensure that bx preserves “similarity” 4

Random Projections 5

Random projections u -u-u 2γ2γ

u -u-u 2γ2γ To make those points “close” we need to project to a direction orthogonal to the line between them 7

Random projections u -u-u 2γ2γ Any other direction will keep the distant points distant. So if I pick a random r and r.x and r.x’ are closer than γ then probably x and x’ were close to start with. 8

LSH: key ideas Goal: – map feature vector x to bit vector bx – ensure that bx preserves “similarity” Basic idea: use random projections of x – Repeat many times: Pick a random hyperplane r by picking random weights for each feature (say from a Gaussian) Compute the inner product of r with x Record if x is “close to” r (r.x>=0) – the next bit in bx Theory says that is x’ and x have small cosine distance then bx and bx’ will have small Hamming distance 9

10

11

12

13

14

15

16

LSH applications Compact storage of data – and we can still compute similarities LSH also gives very fast …: – approx nearest neighbor method just look at other items with bx’=bx also very fast nearest-neighbor methods for Hamming distance – approximate clustering/blocking cluster = all things with same bx vector 17

Locality Sensitive Hashing (LSH) and Pooling Random Values 18

LSH algorithm Naïve algorithm: – Initialization: For i=1 to outputBits: – For each feature f: » Draw r(f,i) ~ Normal(0,1) – Given an instance x For i=1 to outputBits: LSH[i] = sum(x[f]*r[i,f] for f with non-zero weight in x) > 0 ? 1 : 0 Return the bit-vector LSH 19

LSH algorithm But: storing the k classifiers is expensive in high dimensions – For each of 256 bits, a dense vector of weights for every feature in the vocabulary Storing seeds and random number generators: – Possible but somewhat fragile 20

LSH: “pooling” (van Durme) Better algorithm: – Initialization: Create a pool: – Pick a random seed s – For i=1 to poolSize: » Draw pool[i] ~ Normal(0,1) For i=1 to outputBits: – Devise a random hash function hash(i,f): » E.g.: hash(i,f) = hashcode(f) XOR randomBitString[i] – Given an instance x For i=1 to outputBits: LSH[i] = sum(x[f] * pool[hash(i,f) % poolSize] for f in x) > 0 ? 1 : 0) Return the bit-vector LSH 21

22

LSH: key ideas: pooling Advantages: – with pooling, this is a compact re-encoding of the data you don’t need to store the r’s, just the pool 23

Locality Sensitive Hashing (LSH) in an On-line Setting 24

LSH: key ideas: online computation Common task: distributional clustering – for a word w, x(w) is sparse vector of words that co-occur with w – cluster the w’s 25

26

27

Experiment Corpus: 700M+ tokens, 1.1M distinct bigrams For each, build a feature vector of words that co-occur near it, using on-line LSH Check results with 50,000 actual vectors 28 similar to problem we looked at Tuesday using sketches

Experiment 29

30