Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Similar presentations


Presentation on theme: "Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used."— Presentation transcript:

1 Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

2 Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 2 +1

3 Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 3 +1

4 Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 4 +1

5 Generalization bounds How do we upper bound the true error?  Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary data  VC-dimension: largest point set that can be shattered by h +1 +1 5

6 Popular approach for classification Assume the points are in Euclidean space! Pros  Existence of inner product  Efficient algorithms (SVM)  Good generalization bounds (max margin) Cons  Many natural settings non-Euclidean  Euclidean structure is a strong assumption Recent popular focus  Metric space data Efficient classification for metric data 6

7 7 Metric space (X,d) is a metric space if  X = set of points  d() = distance function nonnegative symmetric triangle inequality inner product → norm norm → metric But ⇐ doesn’t hold חיפה באר שבע תל אביב 208km 95km 113km

8 Classification for metric data? Advantage: often much more natural  much weaker assumption  strings  Images (earthmover distance) Problem: no vector representation  No notion of dot-product (and no kernel)  What to do? Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data 8

9 9 Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its smoothness  It is the smallest value L that satisfies for all points x i,x j in X  Denoted by Suppose hypothesis h: S → {-1,1} is consistent with sample S  Its Lipschitz constant of h is determined by the closest pair of differently labeled points  Or equivalently ≥ 2/d(S +,S − ) +1

10 Efficient classification for metric data 10 Preliminaries: Lipschitz extension Lipschitz extension:  A classic problem in Analysis  given a function f: S → R for S in X, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line  f(1) = 1  f(-1) = -1  credit: A. Oberman

11 Efficient classification for metric data 11 Classification for metric data A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04)  Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions  Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example f(x) = min i [f(x i ) + 2d(x, x i )/d(S +,S − )]over all (x i,y i ) in S  Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic

12 More on von Luxburg & Bousquet Efficient classification for metric data 12 *modulo some cheating

13 von Luxburg & Bousquet cont’d Efficient classification for metric data 13

14 Efficient classification for metric data 14 Two new directions The framework of [vLB ‘04] leaves open two further questions: Constructing h: handling noise  Bias-Variance tradeoff  Which sample points in S should h ignore? Evaluating h on X  In arbitrary metric space, exact NNS requires Θ(n) time  Can we do better? q ~1 +1

15 Efficient classification for metric data 15 Doubling dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric M) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Assoud ‘83], algorithmically by [Clarkson ‘97].  The doubling dimension is ddim(M)=log 2 (M)  A metric is doubling if its doubling dimension is constant  Euclidean: ddim(R d ) = O(d) Packing property of doubling spaces  A set with diameter diam and minimum inter-point distance a, contains at most (diam/a) O(ddim) points Here ≥7.

16 Applications of doubling dimension Major application to databases  Recall that exact NNS requires Θ(n) time in arbitrary metric space  There exists a linear size structure that supports approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension  Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]  Image recognition (Vision) [KG --]  Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]  Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]  Clustering [Tal ‘04, ABS ‘08, FM ‘10]  Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications  Travelling Salesperson [Tal ‘04]  Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]  Machine learning [BLL ‘09, KKL ‘10, KKL --] Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] Message: This is an active line of research… 16

17 Fat-shattering dimension Alon, Ben-David, Cesa-Bianchi, Haussler ‘97 Analogue of VC-dimension for continuous functions Efficient classification for metric data 17

18 Fat-shattering and generalization Efficient classification for metric data 18

19 Our dual use of doubling dimension Interestingly, considering the doubling dimension yields contributes in two different areas Statistical: Function complexity  We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h Computational: efficient approximate NNS Efficient classification for metric data 19

20 Efficient classification for metric data 20 Statistical contribution We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension  vLB provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis:  L-Lipschitz functions shatter a set → inter-point distance is at least 2/L  Packing property → set has (diam L) O(ddim) points  This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity.

21 Efficient classification for metric data 21 Statistical contribution [BST ‘99]:  For any f that classifies a sample of size n correctly, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log 2 (578n) + log(4/  )).  Likewise, if f is correct on all but k examples, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log 2 (578n) + ln(4/  ))] 1/2.  In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L) ddim + 1 Done with the statistical contribution … On to the computational contribution.

22 Efficient classification for metric data 22 Computational contribution Evaluation of h for new points in X  Lipschitz extension function  f(x) = min i [y i + 2d(x, x i )/d(S +,S − )] Requires exact nearest neighbor search, which can be expensive! New tool: (1+  )-approximate nearest neighbor search  2 O(ddim) log n +  O(-ddim) time  [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of  g(x) = (1+  ) f(x) +   e(x) = (1+  ) f(x) -   Note that g(x) ≥ f(x) ≥ e(x) g(x) and e(x) have Lipschitz constant (1+  )L, so they and the approximate function generalizes well g(x) f(x) e(x) 22

23 Efficient classification for metric data 23 Final problem: bias variance tradeoff Which sample points in S should h ignore? If f is correct on all but k examples, we have with probability at least 1−   P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log 2 (578n) +ln(4/  ))] 1/2.  Where d ≤ (diam L) ddim + 1 +1

24 Efficient classification for metric data 24 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible +1

25 Efficient classification for metric data 25 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time +1

26 Efficient classification for metric data 26 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time Minimum vertex cover on a bipartite graph  Equivalent to maximum matching (Konig’s theorem)  Admits an exact solution in O(n 2.376 ) randomized time [MS ‘04] +1

27 Efficient classification for metric data 27 Efficient SRM Algorithm:  For each of O(n 2 ) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L  O(n 4.376 ) randomized time Better algorithm  Binary search over O(n 2 ) values of L  For each value Run greedy 2-approximation Approximate minimum error in O(n 2 log n) time Evaluate approximate generalization bound for this value of L

28 Efficient classification for metric data 28 Conclusion Results:  Generalization bounds for Lipschitz classifiers in doubling spaces  Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS  Efficient Structural Risk Minimization Continuing research: Continuous labels  Risk bound via the doubling dimension  Classifier h determined via an LP  Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints.

29 Application: earthmover distance Efficient classification for metric data 29 ST


Download ppt "Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used."

Similar presentations


Ads by Google