Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 2 +1

Generalization bounds How do we upper bound the true error?  Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary data  VC-dimension: largest point set that can be shattered by h +1 +1 5

Popular approach for classification Assume the points are in Euclidean space! Pros  Existence of inner product  Efficient algorithms (SVM)  Good generalization bounds (max margin) Cons  Many natural settings non-Euclidean  Euclidean structure is a strong assumption Recent popular focus  Metric space data Efficient classification for metric data 6

7 Metric space (X,d) is a metric space if  X = set of points  d() = distance function nonnegative symmetric triangle inequality inner product → norm norm → metric But ⇐ doesn’t hold חיפה באר שבע תל אביב 208km 95km 113km

Classification for metric data? Advantage: often much more natural  much weaker assumption  strings  Images (earthmover distance) Problem: no vector representation  No notion of dot-product (and no kernel)  What to do? Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data 8

9 Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its smoothness  It is the smallest value L that satisfies for all points x i,x j in X  Denoted by Suppose hypothesis h: S → {-1,1} is consistent with sample S  Its Lipschitz constant of h is determined by the closest pair of differently labeled points  Or equivalently ≥ 2/d(S +,S − ) +1

Efficient classification for metric data 10 Preliminaries: Lipschitz extension Lipschitz extension:  A classic problem in Analysis  given a function f: S → R for S in X, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line  f(1) = 1  f(-1) = -1  credit: A. Oberman

Efficient classification for metric data 11 Classification for metric data A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04)  Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions  Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example f(x) = min i [f(x i ) + 2d(x, x i )/d(S +,S − )]over all (x i,y i ) in S  Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic

More on von Luxburg & Bousquet Efficient classification for metric data 12 *modulo some cheating

von Luxburg & Bousquet cont’d Efficient classification for metric data 13

Efficient classification for metric data 14 Two new directions The framework of [vLB ‘04] leaves open two further questions: Constructing h: handling noise  Bias-Variance tradeoff  Which sample points in S should h ignore? Evaluating h on X  In arbitrary metric space, exact NNS requires Θ(n) time  Can we do better? q ~1 +1

Efficient classification for metric data 15 Doubling dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric M) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Assoud ‘83], algorithmically by [Clarkson ‘97].  The doubling dimension is ddim(M)=log 2 (M)  A metric is doubling if its doubling dimension is constant  Euclidean: ddim(R d ) = O(d) Packing property of doubling spaces  A set with diameter diam and minimum inter-point distance a, contains at most (diam/a) O(ddim) points Here ≥7.

Applications of doubling dimension Major application to databases  Recall that exact NNS requires Θ(n) time in arbitrary metric space  There exists a linear size structure that supports approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension  Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]  Image recognition (Vision) [KG --]  Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]  Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]  Clustering [Tal ‘04, ABS ‘08, FM ‘10]  Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications  Travelling Salesperson [Tal ‘04]  Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]  Machine learning [BLL ‘09, KKL ‘10, KKL --] Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] Message: This is an active line of research… 16

Fat-shattering dimension Alon, Ben-David, Cesa-Bianchi, Haussler ‘97 Analogue of VC-dimension for continuous functions Efficient classification for metric data 17

Fat-shattering and generalization Efficient classification for metric data 18

Our dual use of doubling dimension Interestingly, considering the doubling dimension yields contributes in two different areas Statistical: Function complexity  We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h Computational: efficient approximate NNS Efficient classification for metric data 19

Efficient classification for metric data 20 Statistical contribution We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension  vLB provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis:  L-Lipschitz functions shatter a set → inter-point distance is at least 2/L  Packing property → set has (diam L) O(ddim) points  This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity.

Efficient classification for metric data 21 Statistical contribution [BST ‘99]:  For any f that classifies a sample of size n correctly, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log 2 (578n) + log(4/  )).  Likewise, if f is correct on all but k examples, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log 2 (578n) + ln(4/  ))] 1/2.  In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L) ddim + 1 Done with the statistical contribution … On to the computational contribution.

Efficient classification for metric data 22 Computational contribution Evaluation of h for new points in X  Lipschitz extension function  f(x) = min i [y i + 2d(x, x i )/d(S +,S − )] Requires exact nearest neighbor search, which can be expensive! New tool: (1+  )-approximate nearest neighbor search  2 O(ddim) log n +  O(-ddim) time  [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of  g(x) = (1+  ) f(x) +   e(x) = (1+  ) f(x) -   Note that g(x) ≥ f(x) ≥ e(x) g(x) and e(x) have Lipschitz constant (1+  )L, so they and the approximate function generalizes well g(x) f(x) e(x) 22

Efficient classification for metric data 23 Final problem: bias variance tradeoff Which sample points in S should h ignore? If f is correct on all but k examples, we have with probability at least 1−   P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log 2 (578n) +ln(4/  ))] 1/2.  Where d ≤ (diam L) ddim + 1 +1

Efficient classification for metric data 24 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible +1

Efficient classification for metric data 25 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time +1

Efficient classification for metric data 26 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time Minimum vertex cover on a bipartite graph  Equivalent to maximum matching (Konig’s theorem)  Admits an exact solution in O(n 2.376 ) randomized time [MS ‘04] +1

Efficient classification for metric data 27 Efficient SRM Algorithm:  For each of O(n 2 ) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L  O(n 4.376 ) randomized time Better algorithm  Binary search over O(n 2 ) values of L  For each value Run greedy 2-approximation Approximate minimum error in O(n 2 log n) time Evaluate approximate generalization bound for this value of L

Efficient classification for metric data 28 Conclusion Results:  Generalization bounds for Lipschitz classifiers in doubling spaces  Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS  Efficient Structural Risk Minimization Continuing research: Continuous labels  Risk bound via the doubling dimension  Classifier h determined via an LP  Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints.

Application: earthmover distance Efficient classification for metric data 29 ST

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Similar presentations

Presentation on theme: "Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Similar presentations

Presentation on theme: "Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used."— Presentation transcript:

Similar presentations

About project

Feedback