Nearly optimal classification for semimetrics

Slides:

Advertisements

Similar presentations

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Linear Classifiers (perceptrons)

A Metric Notion of Dimension and Its Applications to Learning Robert Krauthgamer (Weizmann Institute) Based on joint works with Lee-Ad Gottlieb, James.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

A general agnostic active learning algorithm

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Active Learning of Binary Classifiers

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Binary Classification Problem Linearly Separable Case

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Visual Recognition Tutorial

SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.

Support Vector Machines

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.

Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Fast, precise and dynamic distance queries Yair BartalHebrew U. Lee-Ad GottliebWeizmann → Hebrew U. Liam RodittyBar Ilan Tsvi KopelowitzBar Ilan → Weizmann.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Support Vector Machines Tao Department of computer science University of Illinois.

Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.

5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.

Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Coarse Differentiation and Planar Multiflows

Support vector machines

CS 9633 Machine Learning Support Vector Machines

Non-Parameter Estimation

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

4. Numerical Integration

Haim Kaplan and Uri Zwick

Regularized risk minimization

Vapnik–Chervonenkis Dimension

Sketching and Embedding are Equivalent for Norms

CSCI B609: “Foundations of Data Science”

Light Spanners for Snowflake Metrics

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Computational Learning Theory

cse 521: design and analysis of algorithms

Support vector machines

Computational Learning Theory

Welcome to the Kernel-Club

Support Vector Machines and Kernels

LECTURE 16: NONPARAMETRIC TECHNIQUES

Support vector machines

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Support vector machines

Supervised machine learning: creating a model

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Introduction to Machine Learning

Presentation transcript:

Nearly optimal classification for semimetrics Lee-Ad Gottlieb Ariel U. Aryeh Kontorovich Ben Gurion U. Pinhas Nisnevitch Tel Aviv U. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 2

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 3

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 4

Generalization bounds How do we upper bound the true error? Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + √[(complexity of h)/n] More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5

Popular approach for classification Assume the points are in Euclidean space! Pros Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin) Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption Recent popular focus Metric space data Semimetric space data

Semimetric space (X, ρ) is a metric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)

Semimetric space (X,d) is a semimetric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′) inner product ⊂ norm ⊂ metric ⊂ semimetric

Semimetric examples (2,2) Shannon-Jensen divergence 5 (0,1) 8 1 Euclidean-squared (ℓ22) Fractional ℓp spaces (p<1) Example: p = ½ ||a-b||p = (∑|ai-bi|p)1/p (0,1) (0,0) (2,2) 1 8 5 (0,2) (0,0) (2,0) 2 8 9

Semimetric examples Hausdorff distance 1-rank Hausdorff distance Point in A farthest from B 1-rank Hausdorff distance Point in A closest to B k-rank Hausdorff distance Point in A k-th closest to B A B A B 10

Semimetric examples Note: (0,2) Example: 8 2 (0,0) (2,0) Semimetrics are often unintuitive Example: Diameter > 2*radius (0,2) (0,0) (2,0) 2 8 11

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Central contribution: We discover that the complexity of the classifier is controlled by Margin ɣ of the sample Density dimension of the space With bounds close to optimal… -1 +1 ɣ

Density dimension Definition: Ball B(x,r) = all points within distance r from x. The density constant (of a metric M) is the minimum value c bounding the number of points in B(x,r) at mutual distance r/2 Example: Density dimension For a set of n d-dimension vectors: SJ divergence: O(d) ℓp (p<1): O(d/p) k-rank Hausdorff: O(k(d+logn)) r r/2 Captures # of distinct directions, Volume growth

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: γ-net N of the sample -1 +1 ɣ

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ

Classifier construction Solution: ɣ-net C of the sample S Must be consistent Brute force construction: O(n2) Crucial question: How many points in the net? -1 +1 ɣ

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C Repeat log(radius(S) / ɣ) times Runtime: n log(radius(S) / ɣ) Optimality: Constructed net may not be optimally small But it’s NP-hard to do much better

Generalization bounds Upshot:

Generalization bounds Upshot: What if we allow the classifier εn errors on the sample? Better bound: Where k is the achieved compression size But NP-hard to optimize bound, so consistent net is the best possible

Generalization bounds Upshot: even under margin assumptions, a sample of size exponential in dens will be required for some distributions