S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review 2015.04.15 Jieun Lee Moses S. Charikar Princeton University Advanced Database.

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Aggregating local image descriptors into compact codes
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Data Mining Classification: Alternative Techniques
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
Efficiently searching for similar images (Kristen Grauman)
Nonparametric Link Prediction in Dynamic Graphs Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Facebook) Michael Jordan (UC Berkeley) 1.
Interchanging distance and capacity in probabilistic mappings Uriel Feige Weizmann Institute.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
The Side-Chain Positioning Problem Joint work with Bernard Chazelle and Mona Singh Carl Kingsford Princeton University.
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Dimensionality Reduction
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
1 Lecture 18 Syntactic Web Clustering CS
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
A General Approach to Online Network Optimization Problems Seffi Naor Computer Science Dept. Technion Haifa, Israel Joint work: Noga Alon, Yossi Azar,
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.
Gennaro Cordasco - How Much Independent Should Individual Contacts be to Form a Small-World? - 19/12/2006 How Much Independent Should Individual Contacts.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Markov Cluster (MCL) algorithm Stijn van Dongen.
CS774. Markov Random Field : Theory and Application Lecture 02
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Presenter : Kuang-Jui Hsu Date : 2011/3/24(Thur.).
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
6.4 Random Fields on Graphs 6.5 Random Fields Models In “Adaptive Cooperative Systems” Summarized by Ho-Sik Seok.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.
1 Approximation Algorithms for Low- Distortion Embeddings into Low- Dimensional Spaces Badoiu et al. (SODA 2005) Presented by: Ethan Phelps-Goodman Atri.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
KNN & Naïve Bayes Hongning Wang
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Dimension reduction for finite trees in L1
Sublinear Algorithmic Tools 3
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Randomized Algorithms CS648
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Locality Sensitive Hashing
Embedding Metrics into Geometric Spaces
CS5112: Algorithms and Data Structures for Applications
Data Mining Classification: Alternative Techniques
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
Clustering.
Topological Signatures For Fast Mobility Analysis
Sublinear Algorihms for Big Data
Locality In Distributed Graph Algorithms
Presentation transcript:

S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database

L OCALITY S ENSITIVE H ASH S CHEME Locality Sensitive Hash(LSH) Scheme distribution on a family F of hash functions operating on a collection of objects such that Efficient algorithms for approximate nearest neighbor search and clustering Compact representation scheme for estimating similarity

C OMPACT SKETCHES FOR ESTIMATIN SIMILARITY Collection of objects(documents, images) Implicit similarity/distance function. Estimate similarity without looking at entire objects. Compute compact sketches of objects so that similarity/distance can be estimated from them.

N EW LSH S CHEMS Rounding algorithms for LPs and SDPs used in the context of approximation algorithms It can be viewed as LSH schems for several collections of objects. New LSH Schems A collection of vectors with the distance Simple alternative to minwise independent permutations for estimating set similarity. A collection of distributions on n points New Schems map distributions to points in the metric space such that, for distributions P and Q.

I NTRODUCTION This is exactly The Jaccard coefficient of similarity used in information retrieval. Collection of subsets Similarity number is in [0,1]. S1S2 similarity

I NTRODUCTION Broder et al [8, 5, 7, 6] introduced the notion of min-wise independent permutations S1 S2 σ σ S1S2

O UR R ESULTS Necessary conditions for existence of similarity preserving hashing(SPH) SPH schemes from rounding algorithms Hash function for vectors based on random hyperplane rounding. Hash function for estimating Earth Mover Distance based on rounding schemes for classification with pairwise relationships.

E XISTENCE OF LSH F UNCTIONS Discuss necessary properties for the existence of LSH function families. sim(x, y) admits an SPH scheme if ∃ family of hash functions F such that

Lemma 1: If sim(x,y) admits an SPH scheme then 1-sim(x,y) satisfies triangle inequality. Proof: : indicator(dummy) variable for the event

Lemma 2: If sim(x,y) admits an SPH scheme then (1+sim(x,y))/2 has an SPH scheme with hash functions mapping objects to {0,1}. Lemma 3: If sim(x,y) admits an SPH scheme then 1-sim(x,y) is isometrically embeddable in the Hamming cube.

R ANDOM HYPERPLANE BASED HASH FUNCTIONS FOR VECTORS Collection of vectors in Choose a random vector Corresponding to this vector, defined hash function as follows: Then for vectors and,

E ARTH M OVER D ISTANCE (EMD) Set of points Distance function Distribution P(L) is a collection of non-negative weights such that Earth Mover Distance between distributions P(L) and Q(L)

R ELAXATION OF SHP In designing hash functions to estimate the EMD, relax the definition of lsh in three ways. 1) Estimate distance measure, not a similarity measure in [0,1]. 2) Allow the hash functions to map obejects to points in metric space and measure ( SPH: d(x,y) = 1 if x ≠ y ) 3) Estimator will approximate EMD.

C LASSIFICATION WITH PAIRWISE RELATIONSHIPS Collection of objects V Labels Assignment of labels h : V→L Cost of assigning label to u: c(u,h(u)) Graph of related objects; for edge e=(u,v), cost paid: Find assignment of labels to minimize cost.

LP R ELAXATION AND R OUNDING Separation cost measured by EMD(P,Q) Rounding algorithm guarnatees P Q

R OUNDING DETAILS Probabilistically approximate metric on L by tree metric (HST) Expected distortion O(log n log log n) EMD on tree metric has nice form: T: subtree P(T): sum of probabilities for leaves in T l T: length of edge leading up from T EMD(P,Q) =

Theorem The rounding scheme gives a hashing scheme shch that Proof: y i,j : Probability that h(p)=l i, h(Q)=l j y i,j give feasible solution to LP for EMD Cost of this solution = E[d(h(P), h(Q)] Hence EMD(P,Q) ≤ E[d(h(P), h(Q)]

W EIGHTED SETS Weighted set: (p 1, p 1,...p n ), weights in [0,1] Kleinberg Kleinberg-Tardos rounding scheme for uniform metric can be thought of as a hashing scheme for weighted sets with Generalization of minwise independent permutations

C ONCLUSION Interesting connection between rounding procedures for approximation algorithms and hash functions for estimating similarity. Better estimators for Earth Mover Distance Ignored variance of estimators: related to dimensionality reduction in L 1 Study compact representation schemes in general