Nearly optimal classification for semimetrics Lee-Ad Gottlieb Ariel U. Aryeh Kontorovich Ben Gurion U. Pinhas Nisnevitch Tel Aviv U. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 2
Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 3
Classification problem A fundamental problem in learning: Point space X Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X Produces hypothesis h: X → {-1,1} with empirical error and true error Goal: uniformly over h in probability +1 -1 4
Generalization bounds How do we upper bound the true error? Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + √[(complexity of h)/n] More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5
Popular approach for classification Assume the points are in Euclidean space! Pros Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin) Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption Recent popular focus Metric space data Semimetric space data
Semimetric space (X, ρ) is a metric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)
Semimetric space (X,d) is a semimetric space if X = set of points ρ() = distance function ρ:X×X → ℝ Nonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′ Symmetric ρ(x,x′) = ρ(x′,x) Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′) inner product ⊂ norm ⊂ metric ⊂ semimetric
Semimetric examples (2,2) Shannon-Jensen divergence 5 (0,1) 8 1 Euclidean-squared (ℓ22) Fractional ℓp spaces (p<1) Example: p = ½ ||a-b||p = (∑|ai-bi|p)1/p (0,1) (0,0) (2,2) 1 8 5 (0,2) (0,0) (2,0) 2 8 9
Semimetric examples Hausdorff distance 1-rank Hausdorff distance Point in A farthest from B 1-rank Hausdorff distance Point in A closest to B k-rank Hausdorff distance Point in A k-th closest to B A B A B 10
Semimetric examples Note: (0,2) Example: 8 2 (0,0) (2,0) Semimetrics are often unintuitive Example: Diameter > 2*radius (0,2) (0,0) (2,0) 2 8 11
Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
Classification for semimetrics? Problem: no vector representation No notion of dot-product (and no kernel) What to do? Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim! Our approach Sample compression NN classification Result: Strong generalization bounds -1 +1
Classification for semimetrics? Central contribution: We discover that the complexity of the classifier is controlled by Margin ɣ of the sample Density dimension of the space With bounds close to optimal… -1 +1 ɣ
Density dimension Definition: Ball B(x,r) = all points within distance r from x. The density constant (of a metric M) is the minimum value c bounding the number of points in B(x,r) at mutual distance r/2 Example: Density dimension For a set of n d-dimension vectors: SJ divergence: O(d) ℓp (p<1): O(d/p) k-rank Hausdorff: O(k(d+logn)) r r/2 Captures # of distinct directions, Volume growth
Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ
Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: γ-net N of the sample -1 +1 ɣ
Classifier construction Recall Compress sample NN Initial approach: Classifier consistent with respect the sample That is, NN reconstructs the sample exactly Solution: ɣ-net N of the sample -1 +1 ɣ
Classifier construction Solution: ɣ-net C of the sample S Must be consistent Brute force construction: O(n2) Crucial question: How many points in the net? -1 +1 ɣ
Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2
Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C
Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C
Classifier construction Theorem: If C is an optimally small ɣ-net of the sample S |C| ≤ (radius(S) / ɣ)dens(S) Proof: C ← 2dens(S) points at mutual distance radius(S)/2 Associate each point not in C with NN in C Repeat log(radius(S) / ɣ) times Runtime: n log(radius(S) / ɣ) Optimality: Constructed net may not be optimally small But it’s NP-hard to do much better
Generalization bounds Upshot:
Generalization bounds Upshot: What if we allow the classifier εn errors on the sample? Better bound: Where k is the achieved compression size But NP-hard to optimize bound, so consistent net is the best possible
Generalization bounds Upshot: even under margin assumptions, a sample of size exponential in dens will be required for some distributions