Download presentation
Presentation is loading. Please wait.
Published byThomas James Modified over 8 years ago
1
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review 2015.04.15 Jieun Lee Moses S. Charikar Princeton University Advanced Database
2
L OCALITY S ENSITIVE H ASH S CHEME Locality Sensitive Hash(LSH) Scheme distribution on a family F of hash functions operating on a collection of objects such that Efficient algorithms for approximate nearest neighbor search and clustering Compact representation scheme for estimating similarity
3
C OMPACT SKETCHES FOR ESTIMATIN SIMILARITY Collection of objects(documents, images) Implicit similarity/distance function. Estimate similarity without looking at entire objects. Compute compact sketches of objects so that similarity/distance can be estimated from them.
4
N EW LSH S CHEMS Rounding algorithms for LPs and SDPs used in the context of approximation algorithms It can be viewed as LSH schems for several collections of objects. New LSH Schems A collection of vectors with the distance Simple alternative to minwise independent permutations for estimating set similarity. A collection of distributions on n points New Schems map distributions to points in the metric space such that, for distributions P and Q.
5
I NTRODUCTION This is exactly The Jaccard coefficient of similarity used in information retrieval. Collection of subsets Similarity number is in [0,1]. S1S2 similarity
6
I NTRODUCTION Broder et al [8, 5, 7, 6] introduced the notion of min-wise independent permutations S1 S2 σ σ S1S2
7
O UR R ESULTS Necessary conditions for existence of similarity preserving hashing(SPH) SPH schemes from rounding algorithms Hash function for vectors based on random hyperplane rounding. Hash function for estimating Earth Mover Distance based on rounding schemes for classification with pairwise relationships.
8
E XISTENCE OF LSH F UNCTIONS Discuss necessary properties for the existence of LSH function families. sim(x, y) admits an SPH scheme if ∃ family of hash functions F such that
9
Lemma 1: If sim(x,y) admits an SPH scheme then 1-sim(x,y) satisfies triangle inequality. Proof: : indicator(dummy) variable for the event
10
Lemma 2: If sim(x,y) admits an SPH scheme then (1+sim(x,y))/2 has an SPH scheme with hash functions mapping objects to {0,1}. Lemma 3: If sim(x,y) admits an SPH scheme then 1-sim(x,y) is isometrically embeddable in the Hamming cube.
11
R ANDOM HYPERPLANE BASED HASH FUNCTIONS FOR VECTORS Collection of vectors in Choose a random vector Corresponding to this vector, defined hash function as follows: Then for vectors and,
12
E ARTH M OVER D ISTANCE (EMD) Set of points Distance function Distribution P(L) is a collection of non-negative weights such that Earth Mover Distance between distributions P(L) and Q(L)
13
R ELAXATION OF SHP In designing hash functions to estimate the EMD, relax the definition of lsh in three ways. 1) Estimate distance measure, not a similarity measure in [0,1]. 2) Allow the hash functions to map obejects to points in metric space and measure ( SPH: d(x,y) = 1 if x ≠ y ) 3) Estimator will approximate EMD.
14
C LASSIFICATION WITH PAIRWISE RELATIONSHIPS Collection of objects V Labels Assignment of labels h : V→L Cost of assigning label to u: c(u,h(u)) Graph of related objects; for edge e=(u,v), cost paid: Find assignment of labels to minimize cost.
15
LP R ELAXATION AND R OUNDING Separation cost measured by EMD(P,Q) Rounding algorithm guarnatees P Q
16
R OUNDING DETAILS Probabilistically approximate metric on L by tree metric (HST) Expected distortion O(log n log log n) EMD on tree metric has nice form: T: subtree P(T): sum of probabilities for leaves in T l T: length of edge leading up from T EMD(P,Q) =
17
Theorem The rounding scheme gives a hashing scheme shch that Proof: y i,j : Probability that h(p)=l i, h(Q)=l j y i,j give feasible solution to LP for EMD Cost of this solution = E[d(h(P), h(Q)] Hence EMD(P,Q) ≤ E[d(h(P), h(Q)]
18
W EIGHTED SETS Weighted set: (p 1, p 1,...p n ), weights in [0,1] Kleinberg Kleinberg-Tardos rounding scheme for uniform metric can be thought of as a hashing scheme for weighted sets with Generalization of minwise independent permutations
19
C ONCLUSION Interesting connection between rounding procedures for approximation algorithms and hash functions for estimating similarity. Better estimators for Earth Mover Distance Ignored variance of estimators: related to dimensionality reduction in L 1 Study compact representation schemes in general
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.