Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.

Similar presentations


Presentation on theme: "Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute."— Presentation transcript:

1

2 Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute

3 Setting: Supervised binary classification in a metric space Instance (metric) space (,d) Probability distribution P on  {-1,1}  (agnostic PAC -- think noisy concept) Learner observes sample S:  n points (x,y) drawn iid ~P  produces hypothesis h: → {-1,1} Generalization error  P[ h(X)≠Y ] Adaptive Metric Dimensionality Reduction 2 +1

4 Adaptive Metric Dimensionality Reduction 3 Metric space (,d) is a metric space if  X = set of points  d = distance function d: 2 → ℝ Nonnegative d(x,x′) = 0 ⇔ x = x′ Symmetric d(x,x′) = d(x′,x) triangle inequality d(x,x′) ≤ d(x,z) + d(z,x′) “no coordinates – just distances” inner product ⇒ norm ||x|| 2 =〈x,x〉 norm ⇒ metric d(x,x′) = ||x - x′|| NOT ⇐ Take-away: metric assumption far less restrictive How to classify in a metric space? Nearest neighbors! Tel Aviv Singapore London 7955 3553 10847 (some variant of)

5 Curse of dimensionality Learning in high dimensions is hard  Statistically: many examples are needed  Computationally: building a classifier is expensive “Real” data tends to have  High ambient dimension  Low intrinsic dimension Challenge: exploit low intrinsic dimensionality statistically and compuationally This talk: some of the first such results in supervised learning Adaptive Metric Dimensionality Reduction 4

6 Nearest-Neighbor Classifier h NN ( x ) = label of sample point closest to x …is terrific! One of the oldest classification algorithms “Simple” Requires minimal geometric structure (metric) Suitable for multi-class Asymptotically consistent (expected error twice the optimal Bayes rate) … but has statistical and computational drawbacks Infinite VC-dimension Distribution-free rate is impossible Exact computation requires time  (n) Adaptive Metric Dimensionality Reduction 5 x

7 Cover, Hart, 1967; Ben-David, Shalev-Shwartz, 2014+ η(x) = P[Y=1|X=x] assume Lipschitz continuous: |η(x) - η(x′)| ≤ Ld(x,x′) Bayes optimal classifier: threshold η(x) at 1/2 h*(x) = sign(η(x)-1/2) Bayes error err(h*) = [min{η(X),1-η(X)}] Theorem: for the metric space ([0,1] k,║∙║ 2 ) E[err(h NN )] ≤ 2 err(h*) + O(L k 1/2 / n 1/(k+1) ) Tightness [Curse of dimensionality]: need n =  ((L+1) k ) Exists distribution for which err(h*)=0 but n ≤ (L+1) k /2  [err(h NN )] >¼ Adaptive Metric Dimensionality Reduction 6

8 A Newer Look Let’s take a richer hypothesis class  f maps to [-1,+1] instead of {-1,+1}  Classify by thresholding at zero [von Luxburg & Bousquet JMLR ’04]  View sample S = {(X i,Y i )} as evaluations of a [-1,+1] function  Lipschitz-extend {-1,+1} data to [-1,+1] function on whole space  margin d(S +,S − ) = inverse Lipschitz constant  Algorithmically realized by nearest neighbor Left open  Efficient NN search  Smoothing/ denoising/ Structural Risk Minimization/ Regularization GKK’2010 addressed these issues Adaptive Metric Dimensionality Reduction 7 - + d(S+,S−)d(S+,S−) }

9 8 Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric ) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Ass-83], algorithmically by [Cla-97].  The doubling dimension is ddim()=log 2 () [GKL-03]  A metric is doubling if its doubling dimension is finite  Euclidean: ddim(R n ) = O(n) Summary  Intimately connected to covering numbers  Analogue of Euclidean dimension In geometry, one of many metric dimensions In CS, basically just ddim Here ≥7.

10 Adaptive Metric Dimensionality Reduction 9 NN excess risk Previous: O(L k 1/2 / n 1/(ddim+1) ) GKK’10: O([L ddim log(n)/n] 1/2 )

11 Metric Dimensionality Reduction Runtime + Sample complexity are exponential in ddim  (1+  )-approximate nearest neighbor search in time 2 ddim log n +  -ddim  Generalization bounds decay as min{n -1/ddim,L ddim n -1/2 } All existing bounds work with ambient dimension Insensitive to intrinsic data dimension What if the intrinsic data dimension is much lower than ambient? What if data is close to being low-dimensional? Adaptive Metric Dimensionality Reduction 10

12 Principal Components Analysis (PCA) Data {X i } with X i in ℝ N A k-dimensional subspace T ⊂ ℝ N Induces distortion  :  = ∑ i || X i – P T (X i ) || 2 [P T (·) = orthogonal projection onto T] dimension k and (optimal) distortion  have a simple relationship   = ∑ j=k+1 N  j 2   1 ≥  2 ≥ … ≥  N are the singular values of data matrix X Uses of PCA:  Denoising  Discovering the “inherent” dimension of the data How to choose the cutoff k?  Heuristics (such as looking for “jump”  j ≫  j+1 )  To our knowledge, no principled guidelines Adaptive Metric Dimensionality Reduction 11

13 PCA and supervised classification Labeled sample (X i,Y i ) with X i in ℝ N and Y i in {-1,+1} Ambient space is high-dimensional: N ≫ 1 Common heuristic: run PCA prior to SVM. Benefits  Computational: everything is faster in lower dimensions!  Statistical (less well-understood) Denoising (heuristic) Better generalization guarantees?.. Drawbacks  Theoretically unmotivated  What’s the “right” dimension / singular value cutoff?  Won’t this mess up the margin? Adaptive Metric Dimensionality Reduction 12

14 Principal Components Analysis Labeled sample (X i,Y i )  n sample points  ||X i || ≤ 1 in ℝ N  Y i in {-1,+1} Thm [GKK’2013]: For all  > 0, with prob. ≥ 1-  :  For all separating hyperplanes ||w|| ≤ 1 in ℝ N  For all subspaces T ⊂ ℝ N with dim(T) = k  Which incur distortion  = ∑ i || X i – P T (X i ) || 2 We have L hinge (w·X,Y) ≤ (1/n)∑ i L hinge (w·X i,Y i ) + 34(k/n) 1/2 + 2(  /n) 1/2 + 3[log(2/  )/2n] 1/2 Distortion  plays the role of inverse margin To our knowledge, first rigorous guide to PCA cutoff Adaptive Metric Dimensionality Reduction 13 Principled

15 General Metric Spaces Metric space (,d) Labeled sample (X i,Y i ) Distortion  = ∑ i d(X i, Ẋ i ), where { Ẋ i } is a “perturbed set” Dimensionality reduction: ddim({ Ẋ i }) < ddim({X i }) ≤ log 2 n Optimal tradeoff dictated by data via Rademacher analysis R n = O( L(1+  )n -1/D ) [Dudley’s chaining technique]  L = Lipschitz constant of induced hypothesis   = distortion  D = ddim({ Ẋ i }) intrinsic data dimension Generalization performance does not depend on the ambient dimension ddim({X i }) Adaptive Metric Dimensionality Reduction 14

16 Dimensionality reduction algorithm Given a point-set and target dimension d What is the smallest distortion possible? Exact solution seems hard We give (O(1),O(1))-bicriteria approximation Adaptive Metric Dimensionality Reduction 15

17 Hierarchies Every discrete space admits a point hierarchy:

18 Hierarchies Every discrete space admits a point hierarchy: Level 0: Each point is center of 1-radius ball

19 Hierarchies Every discrete space admits a point hierarchy: Level 1: All 1-radius balls are covered by 2-radius balls Covering

20 Hierarchies Every discrete space admits a point hierarchy: Balls have minimum interpoint distance Packing

21 Hierarchies Every discrete space admits a point hierarchy: Level 2: Big ball of radius 4

22 Hierarchies Every discrete space admits a point hierarchy:

23 Hierarchies Every discrete space admits a point hierarchy:  Covering  Packing  Nested Key property: In a doubling hierarchy, each ball neighbors only a small number of other balls (at each level)

24 Integer program Consider hierarchy over the training sample An integer program extracts a sub-hierarchy with small doubling dimension.  Indicator variable z j i represents point x j in level i  Let E j i be all i-level points close to point x j. Minimize cost ∑c j, subject to constraints:  z j i  {0,1}x j present in level i?  z j i ≤ z j i-1 nested  z j i ≤ |N j i +1 | covering  |N j i | ≤ 2 d small target doubling dimension  c j ≥ 2 i [(1-z j 0 ) - |N j i +1 |]c j proxy for cost of deleting x j

25 Linear program Solving the integer program is somewhat involved:  Bicriteria algorithm: approximate cost, dimension  Linear relaxation: z j i  [0,1]  Rounding scheme  Runtime 2 O(ddim) + O(n log 4 n)

26 Thank you Questions? Adaptive Metric Dimensionality Reduction 25


Download ppt "Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute."

Similar presentations


Ads by Google