Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute.

Slides:

Advertisements

Similar presentations

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Shortest Vector In A Lattice is NP-Hard to approximate

Lecture 9 Support Vector Machines

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Support Vector Machines

Machine Learning Week 2 Lecture 1.

A general agnostic active learning algorithm

CMPUT 466/551 Principal Source: CMU

A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.

“Random Projections on Smooth Manifolds” -A short summary

Efficient classification for metric data Lee-Ad GottliebHebrew U. Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

3D Geometry for Computer Graphics

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!

The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Probably Approximately Correct Model (PAC)

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Computational Learning Theory; The Tradeoff between Computational Complexity and Statistical Soundness Shai Ben-David CS Department, Cornell and Technion,

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Support Vector Machines

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

SVM Support Vectors Machines

Visual Recognition Tutorial

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion.

Dimensionality Reduction

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension Lee-Ad GottliebAriel University Aryeh KontorovichBen-Gurion University Robert.

Summarized by Soo-Jin Kim

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Machine Learning CSE 681 CH2 - Supervised Learning.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Universit at Dortmund, LS VIII

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

Linear Learning Machines and SVM The Perceptron Algorithm revisited

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

CSE 185 Introduction to Computer Vision Face Recognition.

Support Vector Machines Tao Department of computer science University of Illinois.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

CS 9633 Machine Learning Support Vector Machines

Nearly optimal classification for semimetrics

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Lecture 05: K-nearest neighbors

Machine Learning Basics

Clustering (3) Center-based algorithms Fuzzy k-means

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

COSC 4335: Other Classification Techniques

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Lecture 03: K-nearest neighbors

Presentation transcript:

Adaptive Metric Dimensionality Reduction Aryeh KontorovichBen Gurion U. joint work with: Lee-Ad GottliebAriel U. Robert KrauthgamerWeizmann Institute

Setting: Supervised binary classification in a metric space Instance (metric) space (,d) Probability distribution P on  {-1,1}  (agnostic PAC -- think noisy concept) Learner observes sample S:  n points (x,y) drawn iid ~P  produces hypothesis h: → {-1,1} Generalization error  P[ h(X)≠Y ] Adaptive Metric Dimensionality Reduction 2 +1

Adaptive Metric Dimensionality Reduction 3 Metric space (,d) is a metric space if  X = set of points  d = distance function d: 2 → ℝ Nonnegative d(x,x′) = 0 ⇔ x = x′ Symmetric d(x,x′) = d(x′,x) triangle inequality d(x,x′) ≤ d(x,z) + d(z,x′) “no coordinates – just distances” inner product ⇒ norm ||x|| 2 =〈x,x〉 norm ⇒ metric d(x,x′) = ||x - x′|| NOT ⇐ Take-away: metric assumption far less restrictive How to classify in a metric space? Nearest neighbors! Tel Aviv Singapore London (some variant of)

Curse of dimensionality Learning in high dimensions is hard  Statistically: many examples are needed  Computationally: building a classifier is expensive “Real” data tends to have  High ambient dimension  Low intrinsic dimension Challenge: exploit low intrinsic dimensionality statistically and compuationally This talk: some of the first such results in supervised learning Adaptive Metric Dimensionality Reduction 4

Nearest-Neighbor Classifier h NN ( x ) = label of sample point closest to x …is terrific! One of the oldest classification algorithms “Simple” Requires minimal geometric structure (metric) Suitable for multi-class Asymptotically consistent (expected error twice the optimal Bayes rate) … but has statistical and computational drawbacks Infinite VC-dimension Distribution-free rate is impossible Exact computation requires time  (n) Adaptive Metric Dimensionality Reduction 5 x

Cover, Hart, 1967; Ben-David, Shalev-Shwartz, η(x) = P[Y=1|X=x] assume Lipschitz continuous: |η(x) - η(x′)| ≤ Ld(x,x′) Bayes optimal classifier: threshold η(x) at 1/2 h*(x) = sign(η(x)-1/2) Bayes error err(h*) = [min{η(X),1-η(X)}] Theorem: for the metric space ([0,1] k,║∙║ 2 ) E[err(h NN )] ≤ 2 err(h*) + O(L k 1/2 / n 1/(k+1) ) Tightness [Curse of dimensionality]: need n =  ((L+1) k ) Exists distribution for which err(h*)=0 but n ≤ (L+1) k /2  [err(h NN )] >¼ Adaptive Metric Dimensionality Reduction 6

A Newer Look Let’s take a richer hypothesis class  f maps to [-1,+1] instead of {-1,+1}  Classify by thresholding at zero [von Luxburg & Bousquet JMLR ’04]  View sample S = {(X i,Y i )} as evaluations of a [-1,+1] function  Lipschitz-extend {-1,+1} data to [-1,+1] function on whole space  margin d(S +,S − ) = inverse Lipschitz constant  Algorithmically realized by nearest neighbor Left open  Efficient NN search  Smoothing/ denoising/ Structural Risk Minimization/ Regularization GKK’2010 addressed these issues Adaptive Metric Dimensionality Reduction d(S+,S−)d(S+,S−) }

8 Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric ) is the minimum value such that every ball can be covered by balls of half the radius  First used by [Ass-83], algorithmically by [Cla-97].  The doubling dimension is ddim()=log 2 () [GKL-03]  A metric is doubling if its doubling dimension is finite  Euclidean: ddim(R n ) = O(n) Summary  Intimately connected to covering numbers  Analogue of Euclidean dimension In geometry, one of many metric dimensions In CS, basically just ddim Here ≥7.

Adaptive Metric Dimensionality Reduction 9 NN excess risk Previous: O(L k 1/2 / n 1/(ddim+1) ) GKK’10: O([L ddim log(n)/n] 1/2 )

Metric Dimensionality Reduction Runtime + Sample complexity are exponential in ddim  (1+  )-approximate nearest neighbor search in time 2 ddim log n +  -ddim  Generalization bounds decay as min{n -1/ddim,L ddim n -1/2 } All existing bounds work with ambient dimension Insensitive to intrinsic data dimension What if the intrinsic data dimension is much lower than ambient? What if data is close to being low-dimensional? Adaptive Metric Dimensionality Reduction 10

Principal Components Analysis (PCA) Data {X i } with X i in ℝ N A k-dimensional subspace T ⊂ ℝ N Induces distortion  :  = ∑ i || X i – P T (X i ) || 2 [P T (·) = orthogonal projection onto T] dimension k and (optimal) distortion  have a simple relationship   = ∑ j=k+1 N  j 2   1 ≥  2 ≥ … ≥  N are the singular values of data matrix X Uses of PCA:  Denoising  Discovering the “inherent” dimension of the data How to choose the cutoff k?  Heuristics (such as looking for “jump”  j ≫  j+1 )  To our knowledge, no principled guidelines Adaptive Metric Dimensionality Reduction 11

PCA and supervised classification Labeled sample (X i,Y i ) with X i in ℝ N and Y i in {-1,+1} Ambient space is high-dimensional: N ≫ 1 Common heuristic: run PCA prior to SVM. Benefits  Computational: everything is faster in lower dimensions!  Statistical (less well-understood) Denoising (heuristic) Better generalization guarantees?.. Drawbacks  Theoretically unmotivated  What’s the “right” dimension / singular value cutoff?  Won’t this mess up the margin? Adaptive Metric Dimensionality Reduction 12

Principal Components Analysis Labeled sample (X i,Y i )  n sample points  ||X i || ≤ 1 in ℝ N  Y i in {-1,+1} Thm [GKK’2013]: For all  > 0, with prob. ≥ 1-  :  For all separating hyperplanes ||w|| ≤ 1 in ℝ N  For all subspaces T ⊂ ℝ N with dim(T) = k  Which incur distortion  = ∑ i || X i – P T (X i ) || 2 We have L hinge (w·X,Y) ≤ (1/n)∑ i L hinge (w·X i,Y i ) + 34(k/n) 1/2 + 2(  /n) 1/2 + 3[log(2/  )/2n] 1/2 Distortion  plays the role of inverse margin To our knowledge, first rigorous guide to PCA cutoff Adaptive Metric Dimensionality Reduction 13 Principled

General Metric Spaces Metric space (,d) Labeled sample (X i,Y i ) Distortion  = ∑ i d(X i, Ẋ i ), where { Ẋ i } is a “perturbed set” Dimensionality reduction: ddim({ Ẋ i }) < ddim({X i }) ≤ log 2 n Optimal tradeoff dictated by data via Rademacher analysis R n = O( L(1+  )n -1/D ) [Dudley’s chaining technique]  L = Lipschitz constant of induced hypothesis   = distortion  D = ddim({ Ẋ i }) intrinsic data dimension Generalization performance does not depend on the ambient dimension ddim({X i }) Adaptive Metric Dimensionality Reduction 14

Dimensionality reduction algorithm Given a point-set and target dimension d What is the smallest distortion possible? Exact solution seems hard We give (O(1),O(1))-bicriteria approximation Adaptive Metric Dimensionality Reduction 15

Hierarchies Every discrete space admits a point hierarchy:

Hierarchies Every discrete space admits a point hierarchy: Level 0: Each point is center of 1-radius ball

Hierarchies Every discrete space admits a point hierarchy: Level 1: All 1-radius balls are covered by 2-radius balls Covering

Hierarchies Every discrete space admits a point hierarchy: Balls have minimum interpoint distance Packing

Hierarchies Every discrete space admits a point hierarchy: Level 2: Big ball of radius 4

Hierarchies Every discrete space admits a point hierarchy:

Hierarchies Every discrete space admits a point hierarchy:  Covering  Packing  Nested Key property: In a doubling hierarchy, each ball neighbors only a small number of other balls (at each level)

Integer program Consider hierarchy over the training sample An integer program extracts a sub-hierarchy with small doubling dimension.  Indicator variable z j i represents point x j in level i  Let E j i be all i-level points close to point x j. Minimize cost ∑c j, subject to constraints:  z j i  {0,1}x j present in level i?  z j i ≤ z j i-1 nested  z j i ≤ |N j i +1 | covering  |N j i | ≤ 2 d small target doubling dimension  c j ≥ 2 i [(1-z j 0 ) - |N j i +1 |]c j proxy for cost of deleting x j

Linear program Solving the integer program is somewhat involved:  Bicriteria algorithm: approximate cost, dimension  Linear relaxation: z j i  [0,1]  Rounding scheme  Runtime 2 O(ddim) + O(n log 4 n)

Thank you Questions? Adaptive Metric Dimensionality Reduction 25