Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute
Setting: Supervised binary classification in a metric space Instance (metric) space (,d) Probability distribution P on {-1,1} (agnostic PAC -- think noisy concept) Learner observes sample S: n points (x,y) drawn iid ~P produces hypothesis h: → {-1,1} Generalization error P[ h(X)≠Y ] Adaptive Metric Dimensionality Reduction 2 +1
Adaptive Metric Dimensionality Reduction 3 Metric space (,d) is a metric space if X = set of points d = distance function d: 2 → ℝ Nonnegative d(x,x′) = 0 ⇔ x = x′ Symmetric d(x,x′) = d(x′,x) triangle inequality d(x,x′) ≤ d(x,z) + d(z,x′) “no coordinates – just distances” inner product ⇒ norm ||x|| 2 =〈x,x〉 norm ⇒ metric d(x,x′) = ||x - x′|| NOT ⇐ Take-away: metric assumption far less restrictive How to classify in a metric space? Nearest neighbors! Tel Aviv Singapore London (some variant of)
Curse of dimensionality Learning in high dimensions is hard Statistically: many examples are needed Computationally: building a classifier is expensive “Real” data tends to have High ambient dimension Low intrinsic dimension Challenge: exploit low intrinsic dimensionality statistically and compuationally This talk: some of the first such results in supervised learning Adaptive Metric Dimensionality Reduction 4
Nearest-Neighbor Classifier h NN ( x ) = label of sample point closest to x …is terrific! One of the oldest classification algorithms “Simple” Requires minimal geometric structure (metric) Suitable for multi-class Asymptotically consistent (expected error twice the optimal Bayes rate) … but has statistical and computational drawbacks Infinite VC-dimension Distribution-free rate is impossible Exact computation requires time (n) Adaptive Metric Dimensionality Reduction 5 x
Cover, Hart, 1967; Ben-David, Shalev-Shwartz, η(x) = P[Y=1|X=x] assume Lipschitz continuous: |η(x) - η(x′)| ≤ Ld(x,x′) Bayes optimal classifier: threshold η(x) at 1/2 h*(x) = sign(η(x)-1/2) Bayes error err(h*) = [min{η(X),1-η(X)}] Theorem: for the metric space ([0,1] k,║∙║ 2 ) E[err(h NN )] ≤ 2 err(h*) + O(L k 1/2 / n 1/(k+1) ) Tightness [Curse of dimensionality]: need n = ((L+1) k ) Exists distribution for which err(h*)=0 but n ≤ (L+1) k /2 [err(h NN )] >¼ Adaptive Metric Dimensionality Reduction 6
A Newer Look Let’s take a richer hypothesis class f maps to [-1,+1] instead of {-1,+1} Classify by thresholding at zero [von Luxburg & Bousquet JMLR ’04] View sample S = {(X i,Y i )} as evaluations of a [-1,+1] function Lipschitz-extend {-1,+1} data to [-1,+1] function on whole space margin d(S +,S − ) = inverse Lipschitz constant Algorithmically realized by nearest neighbor Left open Efficient NN search Smoothing/ denoising/ Structural Risk Minimization/ Regularization GKK’2010 addressed these issues Adaptive Metric Dimensionality Reduction d(S+,S−)d(S+,S−) }
8 Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric ) is the minimum value such that every ball can be covered by balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is ddim()=log 2 () [GKL-03] A metric is doubling if its doubling dimension is finite Euclidean: ddim(R n ) = O(n) Summary Intimately connected to covering numbers Analogue of Euclidean dimension In geometry, one of many metric dimensions In CS, basically just ddim Here ≥7.
Adaptive Metric Dimensionality Reduction 9 NN excess risk Previous: O(L k 1/2 / n 1/(ddim+1) ) GKK’10: O([L ddim log(n)/n] 1/2 )
Metric Dimensionality Reduction Runtime + Sample complexity are exponential in ddim (1+ )-approximate nearest neighbor search in time 2 ddim log n + -ddim Generalization bounds decay as min{n -1/ddim,L ddim n -1/2 } All existing bounds work with ambient dimension Insensitive to intrinsic data dimension What if the intrinsic data dimension is much lower than ambient? What if data is close to being low-dimensional? Adaptive Metric Dimensionality Reduction 10
Principal Components Analysis (PCA) Data {X i } with X i in ℝ N A k-dimensional subspace T ⊂ ℝ N Induces distortion : = ∑ i || X i – P T (X i ) || 2 [P T (·) = orthogonal projection onto T] dimension k and (optimal) distortion have a simple relationship = ∑ j=k+1 N j 2 1 ≥ 2 ≥ … ≥ N are the singular values of data matrix X Uses of PCA: Denoising Discovering the “inherent” dimension of the data How to choose the cutoff k? Heuristics (such as looking for “jump” j ≫ j+1 ) To our knowledge, no principled guidelines Adaptive Metric Dimensionality Reduction 11
PCA and supervised classification Labeled sample (X i,Y i ) with X i in ℝ N and Y i in {-1,+1} Ambient space is high-dimensional: N ≫ 1 Common heuristic: run PCA prior to SVM. Benefits Computational: everything is faster in lower dimensions! Statistical (less well-understood) Denoising (heuristic) Better generalization guarantees?.. Drawbacks Theoretically unmotivated What’s the “right” dimension / singular value cutoff? Won’t this mess up the margin? Adaptive Metric Dimensionality Reduction 12
Principal Components Analysis Labeled sample (X i,Y i ) n sample points ||X i || ≤ 1 in ℝ N Y i in {-1,+1} Thm [GKK’2013]: For all > 0, with prob. ≥ 1- : For all separating hyperplanes ||w|| ≤ 1 in ℝ N For all subspaces T ⊂ ℝ N with dim(T) = k Which incur distortion = ∑ i || X i – P T (X i ) || 2 We have L hinge (w·X,Y) ≤ (1/n)∑ i L hinge (w·X i,Y i ) + 34(k/n) 1/2 + 2( /n) 1/2 + 3[log(2/ )/2n] 1/2 Distortion plays the role of inverse margin To our knowledge, first rigorous guide to PCA cutoff Adaptive Metric Dimensionality Reduction 13 Principled
General Metric Spaces Metric space (,d) Labeled sample (X i,Y i ) Distortion = ∑ i d(X i, Ẋ i ), where { Ẋ i } is a “perturbed set” Dimensionality reduction: ddim({ Ẋ i }) < ddim({X i }) ≤ log 2 n Optimal tradeoff dictated by data via Rademacher analysis R n = O( L(1+ )n -1/D ) [Dudley’s chaining technique] L = Lipschitz constant of induced hypothesis = distortion D = ddim({ Ẋ i }) intrinsic data dimension Generalization performance does not depend on the ambient dimension ddim({X i }) Adaptive Metric Dimensionality Reduction 14
Dimensionality reduction algorithm Given a point-set and target dimension d What is the smallest distortion possible? Exact solution seems hard We give (O(1),O(1))-bicriteria approximation Adaptive Metric Dimensionality Reduction 15
Hierarchies Every discrete space admits a point hierarchy:
Hierarchies Every discrete space admits a point hierarchy: Level 0: Each point is center of 1-radius ball
Hierarchies Every discrete space admits a point hierarchy: Level 1: All 1-radius balls are covered by 2-radius balls Covering
Hierarchies Every discrete space admits a point hierarchy: Balls have minimum interpoint distance Packing
Hierarchies Every discrete space admits a point hierarchy: Level 2: Big ball of radius 4
Hierarchies Every discrete space admits a point hierarchy:
Hierarchies Every discrete space admits a point hierarchy: Covering Packing Nested Key property: In a doubling hierarchy, each ball neighbors only a small number of other balls (at each level)
Integer program Consider hierarchy over the training sample An integer program extracts a sub-hierarchy with small doubling dimension. Indicator variable z j i represents point x j in level i Let E j i be all i-level points close to point x j. Minimize cost ∑c j, subject to constraints: z j i {0,1}x j present in level i? z j i ≤ z j i-1 nested z j i ≤ |N j i +1 | covering |N j i | ≤ 2 d small target doubling dimension c j ≥ 2 i [(1-z j 0 ) - |N j i +1 |]c j proxy for cost of deleting x j
Linear program Solving the integer program is somewhat involved: Bicriteria algorithm: approximate cost, dimension Linear relaxation: z j i [0,1] Rounding scheme Runtime 2 O(ddim) + O(n log 4 n)
Thank you Questions? Adaptive Metric Dimensionality Reduction 25