On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

Quantum Versus Classical Proofs and Advice Scott Aaronson Waterloo MIT Greg Kuperberg UC Davis | x {0,1} n ?

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Shortest Vector In A Lattice is NP-Hard to approximate

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.

Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Online learning, minimizing regret, and combining expert advice

Simple Affine Extractors using Dimension Expansion. Matt DeVos and Ariel Gabizon.

On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,

Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

“Random Projections on Smooth Manifolds” -A short summary

Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech.

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.

Support Vector Machines and Kernel Methods

x – independent variable (input)

Derandomized DP  Thus far, the DP-test was over sets of size k  For instance, the Z-Test required three random sets: a set of size k, a set of size k-k’

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Active Learning of Binary Classifiers

Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Learning and testing k-modal distributions Rocco A. Servedio Columbia University Joint work (in progress) with Ilias Diakonikolas UC Berkeley Costis Daskalakis.

Dimensionality Reduction

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.

Topics in Algorithms 2007 Ramesh Hariharan. Random Projections.

PATTERN RECOGNITION AND MACHINE LEARNING

1 Introduction to Quantum Information Processing QIC 710 / CS 667 / PH 767 / CO 681 / AM 871 Richard Cleve DC 2117 Lecture 16 (2011)

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

An Algorithmic Proof of the Lopsided Lovasz Local Lemma Nick Harvey University of British Columbia Jan Vondrak IBM Almaden TexPoint fonts used in EMF.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.

Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Quantum Two 1. 2 Angular Momentum and Rotations 3.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.

List Decoding Using the XOR Lemma Luca Trevisan U.C. Berkeley.

Kernels and Margins Maria Florina Balcan 10/13/2011.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.

Chapter 1 Algorithms with Numbers. Bases and Logs How many digits does it take to represent the number N >= 0 in base 2? With k digits the largest number.

Learning with General Similarity Functions Maria-Florina Balcan.

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,

The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.

Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

New Characterizations in Turnstile Streams with Applications

Background: Lattices and the Learning-with-Errors problem

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Maria Florina Balcan 03/04/2010

Indistinguishability by adaptive procedures with advice, and lower bounds on hardness amplification proofs Aryeh Grinberg, U. Haifa Ronen.

CSCI B609: “Foundations of Data Science”

Support Vector Machines and Kernels

Learning From Observed Data

Presentation transcript:

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT

Generic problem  Given a set of images:, want to learn a linear separator to distinguish men from women.  Problem: pixel representation no good. Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

Generic problem Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.  E.g., K(x,y) = (x ¢ y + 1) m.  :(n-diml space) ! (n m -diml space).

Main point of this work: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features”

Main point of this work: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications:  Practical: alternative to kernelizing the algorithm.  Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

Basic setup, definitions  Instance space X. X  Distribution D, target c. Use P = (D,c).  K(x,y) =  (x) ¢  (y).  P is separable with margin  in  -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢  (x)) <   normalizing |w|=1, |  (x)|=1) P=(D,c) + -  w  Error  at margin  : replace “0” with “  ”. Goal is to use K to get mapping to low-dim’l space.

Idea: Johnson-Lindenstrauss lemma  If P separable with margin  in  -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/  2 )log[1/(  )]) will have a linear separator of error < . [AV] XP=(D,c) + -   If vectors are r 1,r 2,...,r d, then can view as features x i =  (x) ¢ r i.  Problem: uses . Can we do directly, using K as black- box, without computing  ?

3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/  )[1/  2 + ln 1/  ], if P was separable with margin  in  -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error  at margin  /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set  ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

Actually, the argument is pretty easy... (though we did try a lot of things first that didn’t work...)

Key fact Claim : If 9 perfect w of margin  in  -space, then if draw z 1,...,z d 2 D for d ¸ (8/  )[1/  2 + ln 1/  ], whp (1-  ) exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2. Proof: Let S = examples drawn so far. Assume |w|=1, |  (z)|=1 8 z.  w in = proj(w,span(S)), w out = w – w in.  Say w out is large if Pr z ( |w out ¢  (z)| ¸  /2 ) ¸  ; else small.  If small, then done: w’ = w in.  Else, next z has at least  prob of improving S. |w out | 2 Ã |w out | 2 – (  /2) 2  Can happen at most 4/  2 times. a

So....  If draw z 1,...,z d 2 D for d = (8/  )[1/  2 + ln 1/  ], then whp exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, for some w’ =  1  (z 1 )  d  (z d ), Pr (x, l ) 2 P [sign(w’ ¢  (x))  l ] · .  But notice that w’ ¢  (x) =  1 K(x,z 1 )  d K(x,z d ). ) vector (  1,...  d ) is an  -good separator in the feature space: x i = K(x,z i ).  But margin not preserved because of length of target, examples.

How to preserve margin? (mapping #2)  We know 9 w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, given a new x, just want to do an orthogonal projection into that span. (preserves dot-product, decreases |x|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. a

How to improve dimension?  Current mapping gives d = (8/  )[1/  2 + ln 1/  ].  Johnson-Lindenstrauss gives d = O((1/  2 ) log 1/(  ) ).  JL is nice because can have  ¿ 1/d. Good if alg wants data to be perfectly separable. (Learning a separator of margin  can be done in time poly(1/  ), but if no perfect separator exists, minimizing error is NP-hard.)  Answer: just combine the two...

X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd R d1  JL F X X O O O O X X X RNRN F1F1

Mapping #3  Do JL(mapping2(x)).  JL says: fix y,w. Random projection M down to space of dimension O(1/  2 log 1/  ’) will with prob (1-  ’) preserve margin of y up to §  /4.  Use  ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass  ] < .  So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.

Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D.  Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’.  c = arbitrary function (so learning is hopeless).  But we have this magic kernel K(x,y) =  (x) ¢  (y)  (x) = (1,0) if x  X’.  (x) = (-½, p 3/2) if x 2 X’, c(x)=pos.  (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg.  P is separable with margin p 3/2 in  - space.  But, without access to D, all attempts at running K(x,y) will give answer of 1.   

Open Problems  For specific, natural kernels, like, K(x,y) = (1 + x ¢ y) m, Is there an efficient (probability distribution over) mappings that is good for any P = (c,D) for which the kernel is good?  I.e., an efficient analog to JL for these kernels.  Or, at least can these mappings be constructed using less sample-complexity (fewer accesses to D)?