Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.

Similar presentations


Presentation on theme: "Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert."— Presentation transcript:

1 Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

2 Regression A fundamental problem in Machine Learning:  Metric space (X,d)  Probability distribution P on X  [-1,1]  Sample S of n points (X i,Y i ) drawn iid ~P 2 1 1 0 0 1

3 Regression A fundamental problem in Machine Learning:  Metric space (X,d)  Probability distribution P on X  [-1,1]  Sample S of n points (X i,Y i ) drawn iid ~P Produce: Hypothesis h: X → [-1,1]  empirical risk:  expected risk: q={1,2} Goal:  uniformly over h in probability,  And have small R n (h)  h can be evaluated efficiently on new points 3 1 1 0 ?

4 A popular solution For Euclidean space:  Kernel regression (Nadaraya-Watson)  For vector v, let K n (v) = e -(||v||/  ) 2 Hypothesis evaluation on new x 4 1 1 0 ?

5 Kernel regression Kernel Regression Pros  Achieves minimax rate (for Euclidean with Gaussian noise)  Other algorithms: SVR, Spline regression Cons:  Evaluation for new point: linear in sample size  Assumes Euclidean space: What about metric space? 5

6 6 Metric space ( X,d) is a metric space if  X = set of points  d = distance function Nonnegative:d(x,y) ≥ 0 Symmetric: d(x,y) = d(y,x) Triangle inequality:d(x,y) ≤ d(x,z) + d(z,y) Inner product ⇒ norm Norm ⇒ metricd(x,y) := ||x-y|| Other direction does not hold

7 Regression for metric data? Advantage: often much more natural  much weaker assumption  Strings - edit distance (DNA)  Images - earthmover distance Problem: no vector representation  No notion of dot-product (and no kernel)  Invent kernel? Possible √logn distortion 7 AACGTA AGTT 

8 Metric regression Goal: Give class of hypotheses which generalize well  Perform well on new points Generalization: Want h with  R n (h): empirical error R(h): expected error What types of hypotheses generalize well?  Complexity: VC, Fat-shattering dimensions 8

9 VC dimension Generalization: Want  R n (h): empirical error R(h): expected error How do we upper bound the expected error?  Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 1: VC dimension complexity of the hypothesis class  VC-dimension: largest point set that can be shattered by h +1 9

10 Fat-shattering dimension Generalization: Want  R n (h): empirical error R(h): expected error How do we upper bound the expected error?  Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 2: Fat-shattering dimension of the hypothesis class  Largest point set that can be shattered with min distance  from h +1 10

11 Generalization Conclustion: Simple hypotheses generalize well  In particular, those with low Fat-Shattering dimension Can we find a hypothesis class  For metric space  Low Fat-shattering dimension? Preliminaries:  Lipschitz constant, extension  Doubling dimension Efficient classification for metric data 11 +1

12 12 Preliminaries: Lipschitz constant The Lipschitz constant of function f: X →   the smallest value L satisfying  x i,x j in X  Denoted by (small  smooth) +1 ≥ 2/L

13 13 Preliminaries: Lipschitz extension Lipschitz extension:  Given a function f: S →  for S ⊂ X with constant L  Extend f to all of X without increasing the Lipschitz constant  Classic problem in Analysis Possible solution Example: Points on the real line  f(1) = 1  f(-1) = -1  picture credit: A. Oberman

14 14 Doubling Dimension Definition: Ball B(x,r) = all points within distance r>0 from x. The doubling constant (of X ) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Ass-83], algorithmically by [Cla-97].  The doubling dimension is ddim( X )=log 2 ( X ) [GKL-03]  Euclidean: ddim(R n ) = O(n) Packing property of doubling spaces  A set with diameter D>0 and min. inter-point distance a>0, contains at most (D/a) O(ddim) points Here ≥7.

15 Applications of doubling dimension Major application  approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension  Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]  Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]  Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]  Clustering [Tal ‘04, ABS ‘08, FM ‘10]  Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications  Travelling Salesperson [Tal ’04, BGK ‘12]  Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]  Machine learning [BLL ‘09, GKK ‘10 ‘13a ‘13b] Message: This is an active line of research…  Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] 15 q G 2 1 1 H 2 1 1 1

16 16 Generalization bounds We provide generalization bounds for  Lipschitz (smooth) functions on spaces with low doubling dimension  [vLB ‘04] provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis:  L-Lipschitz functions shatter a set → inter-point distance is at least 2/L  Packing property → set has (diam L) O(ddim) points  Done! This is the Fat-shattering dimension of the smooth classifier on doubling spaces

17 Generalization bounds 17 Plugging in Fat-Shattering dimension into known bounds, we derive key result: Theorem: Fix ε>0 and q = {1,2}. Let h be a L-Lipschitz hypothesis  [P(R(h)) > Rn(h) + ε] ≤ 24n (288n/ε 2 ) d log(24en/ε) e -ε 2 n/36  Where d ≈ (1+1/(ε/24) (q+1)/2 ) (L/(ε/24) (q+1)/2 ) ddim Upshot: Smooth classifier provably good for doubling spaces

18 Generalization bounds 18 Alternate formulation:  d  With probability at least 1-   where Trade-off  Bias-term R n decreasing in L  Variance-term  (n,L,  ) increasing in L Goal: Find L which minimizes RHS

19 Generalization bounds Previous discussion motivates following hypothesis on sample  linear (q=1) or quadratic (q=2) program computes R n (h) Optimize L for best bias-variance tradeoff  Binary search gives log(n/  ) “guesses” for L For new points  Want f* to stay smooth: Lipschitz extension 19

20 Generalization bounds 20 To calculate hypothesis, can solve convex (or linear) program Final problem: how to solve this program quickly

21 Generalization bounds 21 To calculate hypothesis, can solve convex (or linear) program Problem: O(n 2 ) constraints! Exact solution is costly Solution: (1+  )-stretch spanner  Replace full graph by sparse graph  Degree  -O(ddim) solution f* perturbed by additive error   Size: number of constraints reduced to  -O(ddim) n  Sparsity: variable appears in  -O(ddim) constraints G 2 1 1 H 2 1 1 1

22 Generalization bounds 22 To calculate hypothesis, can solve convex (or linear) program Efficient approximate LP solution  Young [FOCS’ 01] approximately solves LP with sparse constraints  our total runtime: O(  -O(ddim) n log 3 n) Reduce QP to LP  solution suffers additional 2  perturbation  O(1/  ) new constraints

23 Thank you! Questions? 23


Download ppt "Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert."

Similar presentations


Ads by Google