Download presentation
Presentation is loading. Please wait.
1
Nearly optimal classification for semimetrics
Lee-Ad Gottlieb Ariel U. Aryeh Kontorovich Ben Gurion U. Pinhas Nisnevitch Tel Aviv U. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
2
Classification problem
A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 2
3
Classification problem
A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 3
4
Classification problem
A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 4
5
Generalization bounds
How do we upper bound the true error? Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + √[(complexity of h)/n] More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5
6
Popular approach for classification
Assume the points are in Euclidean space! Pros Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin) Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption Recent popular focus Metric space data Semimetric space data
7
Semimetric space (X, ρ) is a metric space if X = set of points
ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)
8
Semimetric space (X,d) is a semimetric space if
X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′) inner product ⊂ norm ⊂ metric ⊂ semimetric
9
Semimetric examples (2,2) Shannon-Jensen divergence 5 (0,1) 8 1
Euclidean-squared (ℓ22) Fractional ℓp spaces (p<1) Example: p = ½ ||a-b||p = (∑|ai-bi|p)1/p (0,1) (0,0) (2,2) 1 8 5 (0,2) (0,0) (2,0) 2 8 9
10
Semimetric examples Hausdorff distance 1-rank Hausdorff distance
Point in A farthest from B 1-rank Hausdorff distance Point in A closest to B k-rank Hausdorff distance Point in A k-th closest to B A B A B 10
11
Semimetric examples Note: (0,2) Example: 8 2 (0,0) (2,0)
Semimetrics are often unintuitive Example: Diameter > 2*radius (0,2) (0,0) (2,0) 2 8 11
12
Classification for semimetrics?
Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
13
Classification for semimetrics?
Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
14
Classification for semimetrics?
Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
15
Classification for semimetrics?
Central contribution: We discover that the complexity of the classifier is controlled by Margin ɣ of the sample Density dimension of the space With bounds close to optimal… -1 +1 ɣ
16
Density dimension Definition: Ball B(x,r) = all points within distance r from x. The density constant (of a metric M) is the minimum value c bounding the number of points in B(x,r) at mutual distance r/2 Example: Density dimension For a set of n d-dimension vectors: SJ divergence: O(d) ℓp (p<1): O(d/p) k-rank Hausdorff: O(k(d+logn)) r r/2 Captures # of distinct directions, Volume growth
17
Classifier construction
Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ
18
Classifier construction
Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: γ-net N of the sample -1 +1 ɣ
19
Classifier construction
Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ
20
Classifier construction
Solution: ɣ-net C of the sample S Must be consistent Brute force construction: O(n2) Crucial question: How many points in the net? -1 +1 ɣ
21
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2
22
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C
23
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C
24
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C Repeat log(radius(S) / ɣ) times Runtime: n log(radius(S) / ɣ) Optimality: Constructed net may not be optimally small But it’s NP-hard to do much better
25
Generalization bounds
Upshot:
26
Generalization bounds
Upshot: What if we allow the classifier εn errors on the sample? Better bound: Where k is the achieved compression size But NP-hard to optimize bound, so consistent net is the best possible
27
Generalization bounds
Upshot: even under margin assumptions, a sample of size exponential in dens will be required for some distributions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.