Download presentation
Presentation is loading. Please wait.
Published byPatricia Flynn Modified over 9 years ago
1
Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
2
Regression A fundamental problem in Machine Learning: Metric space (X,d) Probability distribution P on X [-1,1] Sample S of n points (X i,Y i ) drawn iid ~P 2 1 1 0 0 1
3
Regression A fundamental problem in Machine Learning: Metric space (X,d) Probability distribution P on X [-1,1] Sample S of n points (X i,Y i ) drawn iid ~P Produce: Hypothesis h: X → [-1,1] empirical risk: expected risk: q={1,2} Goal: uniformly over h in probability, And have small R n (h) h can be evaluated efficiently on new points 3 1 1 0 ?
4
A popular solution For Euclidean space: Kernel regression (Nadaraya-Watson) For vector v, let K n (v) = e -(||v||/ ) 2 Hypothesis evaluation on new x 4 1 1 0 ?
5
Kernel regression Kernel Regression Pros Achieves minimax rate (for Euclidean with Gaussian noise) Other algorithms: SVR, Spline regression Cons: Evaluation for new point: linear in sample size Assumes Euclidean space: What about metric space? 5
6
6 Metric space ( X,d) is a metric space if X = set of points d = distance function Nonnegative:d(x,y) ≥ 0 Symmetric: d(x,y) = d(y,x) Triangle inequality:d(x,y) ≤ d(x,z) + d(z,y) Inner product ⇒ norm Norm ⇒ metricd(x,y) := ||x-y|| Other direction does not hold
7
Regression for metric data? Advantage: often much more natural much weaker assumption Strings - edit distance (DNA) Images - earthmover distance Problem: no vector representation No notion of dot-product (and no kernel) Invent kernel? Possible √logn distortion 7 AACGTA AGTT
8
Metric regression Goal: Give class of hypotheses which generalize well Perform well on new points Generalization: Want h with R n (h): empirical error R(h): expected error What types of hypotheses generalize well? Complexity: VC, Fat-shattering dimensions 8
9
VC dimension Generalization: Want R n (h): empirical error R(h): expected error How do we upper bound the expected error? Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 1: VC dimension complexity of the hypothesis class VC-dimension: largest point set that can be shattered by h +1 9
10
Fat-shattering dimension Generalization: Want R n (h): empirical error R(h): expected error How do we upper bound the expected error? Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 2: Fat-shattering dimension of the hypothesis class Largest point set that can be shattered with min distance from h +1 10
11
Generalization Conclustion: Simple hypotheses generalize well In particular, those with low Fat-Shattering dimension Can we find a hypothesis class For metric space Low Fat-shattering dimension? Preliminaries: Lipschitz constant, extension Doubling dimension Efficient classification for metric data 11 +1
12
12 Preliminaries: Lipschitz constant The Lipschitz constant of function f: X → the smallest value L satisfying x i,x j in X Denoted by (small smooth) +1 ≥ 2/L
13
13 Preliminaries: Lipschitz extension Lipschitz extension: Given a function f: S → for S ⊂ X with constant L Extend f to all of X without increasing the Lipschitz constant Classic problem in Analysis Possible solution Example: Points on the real line f(1) = 1 f(-1) = -1 picture credit: A. Oberman
14
14 Doubling Dimension Definition: Ball B(x,r) = all points within distance r>0 from x. The doubling constant (of X ) is the minimum value such that every ball can be covered by balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is ddim( X )=log 2 ( X ) [GKL-03] Euclidean: ddim(R n ) = O(n) Packing property of doubling spaces A set with diameter D>0 and min. inter-point distance a>0, contains at most (D/a) O(ddim) points Here ≥7.
15
Applications of doubling dimension Major application approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] Clustering [Tal ‘04, ABS ‘08, FM ‘10] Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications Travelling Salesperson [Tal ’04, BGK ‘12] Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] Machine learning [BLL ‘09, GKK ‘10 ‘13a ‘13b] Message: This is an active line of research… Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] 15 q G 2 1 1 H 2 1 1 1
16
16 Generalization bounds We provide generalization bounds for Lipschitz (smooth) functions on spaces with low doubling dimension [vLB ‘04] provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis: L-Lipschitz functions shatter a set → inter-point distance is at least 2/L Packing property → set has (diam L) O(ddim) points Done! This is the Fat-shattering dimension of the smooth classifier on doubling spaces
17
Generalization bounds 17 Plugging in Fat-Shattering dimension into known bounds, we derive key result: Theorem: Fix ε>0 and q = {1,2}. Let h be a L-Lipschitz hypothesis [P(R(h)) > Rn(h) + ε] ≤ 24n (288n/ε 2 ) d log(24en/ε) e -ε 2 n/36 Where d ≈ (1+1/(ε/24) (q+1)/2 ) (L/(ε/24) (q+1)/2 ) ddim Upshot: Smooth classifier provably good for doubling spaces
18
Generalization bounds 18 Alternate formulation: d With probability at least 1- where Trade-off Bias-term R n decreasing in L Variance-term (n,L, ) increasing in L Goal: Find L which minimizes RHS
19
Generalization bounds Previous discussion motivates following hypothesis on sample linear (q=1) or quadratic (q=2) program computes R n (h) Optimize L for best bias-variance tradeoff Binary search gives log(n/ ) “guesses” for L For new points Want f* to stay smooth: Lipschitz extension 19
20
Generalization bounds 20 To calculate hypothesis, can solve convex (or linear) program Final problem: how to solve this program quickly
21
Generalization bounds 21 To calculate hypothesis, can solve convex (or linear) program Problem: O(n 2 ) constraints! Exact solution is costly Solution: (1+ )-stretch spanner Replace full graph by sparse graph Degree -O(ddim) solution f* perturbed by additive error Size: number of constraints reduced to -O(ddim) n Sparsity: variable appears in -O(ddim) constraints G 2 1 1 H 2 1 1 1
22
Generalization bounds 22 To calculate hypothesis, can solve convex (or linear) program Efficient approximate LP solution Young [FOCS’ 01] approximately solves LP with sparse constraints our total runtime: O( -O(ddim) n log 3 n) Reduce QP to LP solution suffers additional 2 perturbation O(1/ ) new constraints
23
Thank you! Questions? 23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.