Nearly optimal classification for semimetrics

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Linear Classifiers (perceptrons)
A Metric Notion of Dimension and Its Applications to Learning Robert Krauthgamer (Weizmann Institute) Based on joint works with Lee-Ad Gottlieb, James.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
A general agnostic active learning algorithm
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Active Learning of Binary Classifiers
Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Binary Classification Problem Linearly Separable Case
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Visual Recognition Tutorial
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Support Vector Machines
Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.
Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Fast, precise and dynamic distance queries Yair BartalHebrew U. Lee-Ad GottliebWeizmann → Hebrew U. Liam RodittyBar Ilan Tsvi KopelowitzBar Ilan → Weizmann.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Support Vector Machines Tao Department of computer science University of Illinois.
Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Coarse Differentiation and Planar Multiflows
Support vector machines
CS 9633 Machine Learning Support Vector Machines
Non-Parameter Estimation
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
4. Numerical Integration
Haim Kaplan and Uri Zwick
Regularized risk minimization
Vapnik–Chervonenkis Dimension
Sketching and Embedding are Equivalent for Norms
CSCI B609: “Foundations of Data Science”
Light Spanners for Snowflake Metrics
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Computational Learning Theory
cse 521: design and analysis of algorithms
Support vector machines
Computational Learning Theory
Welcome to the Kernel-Club
Support Vector Machines and Kernels
LECTURE 16: NONPARAMETRIC TECHNIQUES
Support vector machines
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Support vector machines
Clustering.
Supervised machine learning: creating a model
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Introduction to Machine Learning
Presentation transcript:

Nearly optimal classification for semimetrics Lee-Ad Gottlieb Ariel U. Aryeh Kontorovich Ben Gurion U. Pinhas Nisnevitch Tel Aviv U. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 2

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 3

Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 4

Generalization bounds How do we upper bound the true error? Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + √[(complexity of h)/n] More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5

Popular approach for classification Assume the points are in Euclidean space! Pros Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin) Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption Recent popular focus Metric space data Semimetric space data

Semimetric space (X, ρ) is a metric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)

Semimetric space (X,d) is a semimetric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′) inner product ⊂ norm ⊂ metric ⊂ semimetric

Semimetric examples (2,2) Shannon-Jensen divergence 5 (0,1) 8 1 Euclidean-squared (ℓ22) Fractional ℓp spaces (p<1) Example: p = ½ ||a-b||p = (∑|ai-bi|p)1/p (0,1) (0,0) (2,2) 1 8 5 (0,2) (0,0) (2,0) 2 8 9

Semimetric examples Hausdorff distance 1-rank Hausdorff distance Point in A farthest from B 1-rank Hausdorff distance Point in A closest to B k-rank Hausdorff distance Point in A k-th closest to B A B A B 10

Semimetric examples Note: (0,2) Example: 8 2 (0,0) (2,0) Semimetrics are often unintuitive Example: Diameter > 2*radius (0,2) (0,0) (2,0) 2 8 11

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1

Classification for semimetrics? Central contribution: We discover that the complexity of the classifier is controlled by Margin ɣ of the sample Density dimension of the space With bounds close to optimal… -1 +1 ɣ

Density dimension Definition: Ball B(x,r) = all points within distance r from x. The density constant (of a metric M) is the minimum value c bounding the number of points in B(x,r) at mutual distance r/2 Example: Density dimension For a set of n d-dimension vectors: SJ divergence: O(d) ℓp (p<1): O(d/p) k-rank Hausdorff: O(k(d+logn)) r r/2 Captures # of distinct directions, Volume growth

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: γ-net N of the sample -1 +1 ɣ

Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ

Classifier construction Solution: ɣ-net C of the sample S Must be consistent Brute force construction: O(n2) Crucial question: How many points in the net? -1 +1 ɣ

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C

Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C Repeat log(radius(S) / ɣ) times Runtime: n log(radius(S) / ɣ) Optimality: Constructed net may not be optimally small But it’s NP-hard to do much better

Generalization bounds Upshot:

Generalization bounds Upshot: What if we allow the classifier εn errors on the sample? Better bound: Where k is the achieved compression size But NP-hard to optimize bound, so consistent net is the best possible

Generalization bounds Upshot: even under margin assumptions, a sample of size exponential in dens will be required for some distributions