Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Slides:

Advertisements

Similar presentations

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Linear Classifiers (perceptrons)

Doubling dimension and the traveling salesman problem Yair BartalHebrew University Lee-Ad GottliebHebrew University Robert KrauthgamerWeizmann Institute.

A Metric Notion of Dimension and Its Applications to Learning Robert Krauthgamer (Weizmann Institute) Based on joint works with Lee-Ad Gottlieb, James.

Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

A general agnostic active learning algorithm

A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Probably Approximately Correct Model (PAC)

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Binary Classification Problem Linearly Separable Case

Required Sample size for Bayesian network Structure learning

Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.

Visual Recognition Tutorial

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

SVM Support Vectors Machines

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,

Support Vector Machines

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.

Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Fast, precise and dynamic distance queries Yair BartalHebrew U. Lee-Ad GottliebWeizmann → Hebrew U. Liam RodittyBar Ilan Tsvi KopelowitzBar Ilan → Weizmann.

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

A linear time approximation scheme for Euclidean TSP Yair BartalHebrew University Lee-Ad GottliebAriel University TexPoint fonts used in EMF. Read the.

An Introduction to Support Vector Machine (SVM)

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Support Vector Machines Tao Department of computer science University of Illinois.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.

Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

A light metric spanner Lee-Ad Gottlieb. Graph spanners A spanner for graph G is a subgraph H ◦ H contains vertices, subset of edges of G Some qualities.

Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Correlation Clustering

Nearly optimal classification for semimetrics

Sketching and Embedding are Equivalent for Norms

Learning with information of features

Computational Learning Theory

Computational Learning Theory

Support Vector Machines and Kernels

Lecture 15: Least Square Regression Metric Embeddings

Presentation transcript:

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 2 +1

Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 3 +1

Classification problem A fundamental problem in learning:  Point space X Probability distribution P on X x {-1,1}  Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability Efficient classification for metric data 4 +1

Generalization bounds How do we upper bound the true error?  Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n More complex classifier ↔ “easier” to fit to arbitrary data  VC-dimension: largest point set that can be shattered by h

Popular approach for classification Assume the points are in Euclidean space! Pros  Existence of inner product  Efficient algorithms (SVM)  Good generalization bounds (max margin) Cons  Many natural settings non-Euclidean  Euclidean structure is a strong assumption Recent popular focus  Metric space data Efficient classification for metric data 6

7 Metric space (X,d) is a metric space if  X = set of points  d() = distance function nonnegative symmetric triangle inequality inner product → norm norm → metric But ⇐ doesn’t hold חיפה באר שבע תל אביב 208km 95km 113km

Classification for metric data? Advantage: often much more natural  much weaker assumption  strings  Images (earthmover distance) Problem: no vector representation  No notion of dot-product (and no kernel)  What to do? Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data 8

9 Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its smoothness  It is the smallest value L that satisfies for all points x i,x j in X  Denoted by Suppose hypothesis h: S → {-1,1} is consistent with sample S  Its Lipschitz constant of h is determined by the closest pair of differently labeled points  Or equivalently ≥ 2/d(S +,S − ) +1

Efficient classification for metric data 10 Preliminaries: Lipschitz extension Lipschitz extension:  A classic problem in Analysis  given a function f: S → R for S in X, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line  f(1) = 1  f(-1) = -1  credit: A. Oberman

Efficient classification for metric data 11 Classification for metric data A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04)  Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions  Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example f(x) = min i [f(x i ) + 2d(x, x i )/d(S +,S − )]over all (x i,y i ) in S  Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic

More on von Luxburg & Bousquet Efficient classification for metric data 12 *modulo some cheating

von Luxburg & Bousquet cont’d Efficient classification for metric data 13

Efficient classification for metric data 14 Two new directions The framework of [vLB ‘04] leaves open two further questions: Constructing h: handling noise  Bias-Variance tradeoff  Which sample points in S should h ignore? Evaluating h on X  In arbitrary metric space, exact NNS requires Θ(n) time  Can we do better? q ~1 +1

Efficient classification for metric data 15 Doubling dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric M) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Assoud ‘83], algorithmically by [Clarkson ‘97].  The doubling dimension is ddim(M)=log 2 (M)  A metric is doubling if its doubling dimension is constant  Euclidean: ddim(R d ) = O(d) Packing property of doubling spaces  A set with diameter diam and minimum inter-point distance a, contains at most (diam/a) O(ddim) points Here ≥7.

Applications of doubling dimension Major application to databases  Recall that exact NNS requires Θ(n) time in arbitrary metric space  There exists a linear size structure that supports approximate nearest neighbor search in time 2 O(ddim) log n Database/network structures and tasks analyzed via the doubling dimension  Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]  Image recognition (Vision) [KG --]  Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]  Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]  Clustering [Tal ‘04, ABS ‘08, FM ‘10]  Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] Further applications  Travelling Salesperson [Tal ‘04]  Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]  Machine learning [BLL ‘09, KKL ‘10, KKL --] Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] Message: This is an active line of research… 16

Fat-shattering dimension Alon, Ben-David, Cesa-Bianchi, Haussler ‘97 Analogue of VC-dimension for continuous functions Efficient classification for metric data 17

Fat-shattering and generalization Efficient classification for metric data 18

Our dual use of doubling dimension Interestingly, considering the doubling dimension yields contributes in two different areas Statistical: Function complexity  We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h Computational: efficient approximate NNS Efficient classification for metric data 19

Efficient classification for metric data 20 Statistical contribution We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension  vLB provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis:  L-Lipschitz functions shatter a set → inter-point distance is at least 2/L  Packing property → set has (diam L) O(ddim) points  This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity.

Efficient classification for metric data 21 Statistical contribution [BST ‘99]:  For any f that classifies a sample of size n correctly, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log 2 (578n) + log(4/  )).  Likewise, if f is correct on all but k examples, we have with probability at least 1−  P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log 2 (578n) + ln(4/  ))] 1/2.  In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L) ddim + 1 Done with the statistical contribution … On to the computational contribution.

Efficient classification for metric data 22 Computational contribution Evaluation of h for new points in X  Lipschitz extension function  f(x) = min i [y i + 2d(x, x i )/d(S +,S − )] Requires exact nearest neighbor search, which can be expensive! New tool: (1+  )-approximate nearest neighbor search  2 O(ddim) log n +  O(-ddim) time  [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of  g(x) = (1+  ) f(x) +   e(x) = (1+  ) f(x) -   Note that g(x) ≥ f(x) ≥ e(x) g(x) and e(x) have Lipschitz constant (1+  )L, so they and the approximate function generalizes well g(x) f(x) e(x) 22

Efficient classification for metric data 23 Final problem: bias variance tradeoff Which sample points in S should h ignore? If f is correct on all but k examples, we have with probability at least 1−   P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log 2 (578n) +ln(4/  ))] 1/2.  Where d ≤ (diam L) ddim

Efficient classification for metric data 24 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible +1

Efficient classification for metric data 25 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time +1

Efficient classification for metric data 26 Structural Risk Minimization Algorithm  Fix a target Lipschitz constant L O(n 2 ) possibilities  Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error  Goal: Remove as few points as possible Minimum vertex cover  NP-Complete  Admits a 2-approximation in O(E) time Minimum vertex cover on a bipartite graph  Equivalent to maximum matching (Konig’s theorem)  Admits an exact solution in O(n ) randomized time [MS ‘04] +1

Efficient classification for metric data 27 Efficient SRM Algorithm:  For each of O(n 2 ) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L  O(n ) randomized time Better algorithm  Binary search over O(n 2 ) values of L  For each value Run greedy 2-approximation Approximate minimum error in O(n 2 log n) time Evaluate approximate generalization bound for this value of L

Efficient classification for metric data 28 Conclusion Results:  Generalization bounds for Lipschitz classifiers in doubling spaces  Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS  Efficient Structural Risk Minimization Continuing research: Continuous labels  Risk bound via the doubling dimension  Classifier h determined via an LP  Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints.

Application: earthmover distance Efficient classification for metric data 29 ST