Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Introduction to Support Vector Machines (SVM)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Fast Algorithms For Hierarchical Range Histogram Constructions
Doubling dimension and the traveling salesman problem Yair BartalHebrew University Lee-Ad GottliebHebrew University Robert KrauthgamerWeizmann Institute.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Support Vector Machines
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Data mining in 1D: curve fitting
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Advances in Metric Embedding Theory Ofer Neiman Ittai Abraham Yair Bartal Hebrew University.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Binary Classification Problem Linearly Separable Case
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.
Classification and Regression
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,
Support Vector Machines
Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Fast, precise and dynamic distance queries Yair BartalHebrew U. Lee-Ad GottliebWeizmann → Hebrew U. Liam RodittyBar Ilan Tsvi KopelowitzBar Ilan → Weizmann.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Linear Learning Machines and SVM The Perceptron Algorithm revisited
An optimal dynamic spanner for points residing in doubling metric spaces Lee-Ad Gottlieb NYU Weizmann Liam Roditty Weizmann.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
A linear time approximation scheme for Euclidean TSP Yair BartalHebrew University Lee-Ad GottliebAriel University TexPoint fonts used in EMF. Read the.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Support Vector Machines Tao Department of computer science University of Illinois.
Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
A light metric spanner Lee-Ad Gottlieb. Graph spanners A spanner for graph G is a subgraph H ◦ H contains vertices, subset of edges of G Some qualities.
Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Nearly optimal classification for semimetrics
Sketching and Embedding are Equivalent for Norms
Learning with information of features
cse 521: design and analysis of algorithms
Embedding Metrics into Geometric Spaces
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
On Solving Linear Systems in Sublinear Time
Presentation transcript:

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Regression A fundamental problem in Machine Learning:  Metric space (X,d)  Probability distribution P on X  [-1,1]  Sample S of n points (X i,Y i ) drawn iid ~P

Regression A fundamental problem in Machine Learning:  Metric space (X,d)  Probability distribution P on X  [-1,1]  Sample S of n points (X i,Y i ) drawn iid ~P Produce: Hypothesis h: X → [-1,1]  empirical risk:  expected risk: q={1,2} Goal:  uniformly over h in probability,  And have small R n (h)  h can be evaluated efficiently on new points ?

A popular solution For Euclidean space:  Kernel regression (Nadaraya-Watson)  For vector v, let K n (v) = e -(||v||/  ) 2 Hypothesis evaluation on new x ?

Kernel regression Kernel Regression Pros  Achieves minimax rate (for Euclidean with Gaussian noise)  Other algorithms: SVR, Spline regression Cons:  Evaluation for new point: linear in sample size  Assumes Euclidean space: What about metric space? 5

6 Metric space ( X,d) is a metric space if  X = set of points  d = distance function Nonnegative:d(x,y) ≥ 0 Symmetric: d(x,y) = d(y,x) Triangle inequality:d(x,y) ≤ d(x,z) + d(z,y) Inner product ⇒ norm Norm ⇒ metricd(x,y) := ||x-y|| Other direction does not hold

Regression for metric data? Advantage: often much more natural  much weaker assumption  Strings - edit distance (DNA)  Images - earthmover distance Problem: no vector representation  No notion of dot-product (and no kernel)  Invent kernel? Possible √logn distortion 7 AACGTA AGTT 

Metric regression Goal: Give class of hypotheses which generalize well  Perform well on new points Generalization: Want h with  R n (h): empirical error R(h): expected error What types of hypotheses generalize well?  Complexity: VC, Fat-shattering dimensions 8

VC dimension Generalization: Want  R n (h): empirical error R(h): expected error How do we upper bound the expected error?  Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 1: VC dimension complexity of the hypothesis class  VC-dimension: largest point set that can be shattered by h +1 9

Fat-shattering dimension Generalization: Want  R n (h): empirical error R(h): expected error How do we upper bound the expected error?  Use a generalization bound. Roughly speaking (and whp) expected error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary {-1,1} data Example 2: Fat-shattering dimension of the hypothesis class  Largest point set that can be shattered with min distance  from h +1 10

Generalization Conclustion: Simple hypotheses generalize well  In particular, those with low Fat-Shattering dimension Can we find a hypothesis class  For metric space  Low Fat-shattering dimension? Preliminaries:  Lipschitz constant, extension  Doubling dimension Efficient classification for metric data 11 +1

12 Preliminaries: Lipschitz constant The Lipschitz constant of function f: X →   the smallest value L satisfying  x i,x j in X  Denoted by (small  smooth) +1 ≥ 2/L

13 Preliminaries: Lipschitz extension Lipschitz extension:  Given a function f: S →  for S ⊂ X with constant L  Extend f to all of X without increasing the Lipschitz constant  Classic problem in Analysis Possible solution Example: Points on the real line  f(1) = 1  f(-1) = -1  picture credit: A. Oberman

14 Doubling Dimension Definition: Ball B(x,r) = all points within distance r>0 from x. The doubling constant (of X ) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Ass-83], algorithmically by [Cla-97].  The doubling dimension is ddim( X )=log 2 ( X ) [GKL-03]  Euclidean: ddim(R n ) = O(n) Packing property of doubling spaces  A set with diameter D>0 and min. inter-point distance a>0, contains at most (D/a) O(ddim) points Here ≥7.

Applications of doubling dimension Major application  approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension  Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]  Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]  Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]  Clustering [Tal ‘04, ABS ‘08, FM ‘10]  Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications  Travelling Salesperson [Tal ’04, BGK ‘12]  Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]  Machine learning [BLL ‘09, GKK ‘10 ‘13a ‘13b] Message: This is an active line of research…  Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] 15 q G H

16 Generalization bounds We provide generalization bounds for  Lipschitz (smooth) functions on spaces with low doubling dimension  [vLB ‘04] provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis:  L-Lipschitz functions shatter a set → inter-point distance is at least 2/L  Packing property → set has (diam L) O(ddim) points  Done! This is the Fat-shattering dimension of the smooth classifier on doubling spaces

Generalization bounds 17 Plugging in Fat-Shattering dimension into known bounds, we derive key result: Theorem: Fix ε>0 and q = {1,2}. Let h be a L-Lipschitz hypothesis  [P(R(h)) > Rn(h) + ε] ≤ 24n (288n/ε 2 ) d log(24en/ε) e -ε 2 n/36  Where d ≈ (1+1/(ε/24) (q+1)/2 ) (L/(ε/24) (q+1)/2 ) ddim Upshot: Smooth classifier provably good for doubling spaces

Generalization bounds 18 Alternate formulation:  d  With probability at least 1-   where Trade-off  Bias-term R n decreasing in L  Variance-term  (n,L,  ) increasing in L Goal: Find L which minimizes RHS

Generalization bounds Previous discussion motivates following hypothesis on sample  linear (q=1) or quadratic (q=2) program computes R n (h) Optimize L for best bias-variance tradeoff  Binary search gives log(n/  ) “guesses” for L For new points  Want f* to stay smooth: Lipschitz extension 19

Generalization bounds 20 To calculate hypothesis, can solve convex (or linear) program Final problem: how to solve this program quickly

Generalization bounds 21 To calculate hypothesis, can solve convex (or linear) program Problem: O(n 2 ) constraints! Exact solution is costly Solution: (1+  )-stretch spanner  Replace full graph by sparse graph  Degree  -O(ddim) solution f* perturbed by additive error   Size: number of constraints reduced to  -O(ddim) n  Sparsity: variable appears in  -O(ddim) constraints G H

Generalization bounds 22 To calculate hypothesis, can solve convex (or linear) program Efficient approximate LP solution  Young [FOCS’ 01] approximately solves LP with sparse constraints  our total runtime: O(  -O(ddim) n log 3 n) Reduce QP to LP  solution suffers additional 2  perturbation  O(1/  ) new constraints

Thank you! Questions? 23