Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.

Slides:

Advertisements

Similar presentations

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.

Advertisements

Linear Regression.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Linear Classifiers (perceptrons)

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Online learning, minimizing regret, and combining expert advice

Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.

Boosting Approach to ML

On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,

Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.

Support Vector Machines and Kernel Methods

Machine Learning Week 2 Lecture 2.

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Active Learning with Support Vector Machines

New Theoretical Frameworks for Machine Learning

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Online Learning Algorithms

An Introduction to Support Vector Machines Martin Law.

Neural Networks Lecture 8: Two simple learning algorithms

Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.

SVM by Sequential Minimal Optimization (SMO)

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction to Support Vector Machines (M. Law)

Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.

Experts and Multiplicative Weights slides from Avrim Blum.

Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Kernels and Margins Maria Florina Balcan 10/13/2011.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Learning with General Similarity Functions Maria-Florina Balcan.

1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,

On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.

Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Carnegie Mellon University

Dan Roth Department of Computer and Information Science

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Nonparametric Methods: Support Vector Machines

CS 4/527: Artificial Intelligence

CSCI B609: “Foundations of Data Science”

Maria Florina Balcan 03/04/2010

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07

Suppose you want to… use learning to solve some classification problem. E.g., given a set of images, learn a rule to distinguish men from women.  The first thing you need to do is decide what you want as features.  Or, for algs like SVM and Perceptron, can use a kernel function, which provides an implicit feature space. But then what kernel to use?  Can Theory provide any help or guidance?

Plan for this talk Discuss a few ways theory might be of help:  Algorithms designed to do well in large feature spaces when only a small number of features are actually useful. So you can pile a lot on when you don’t know much about the domain.  Kernel functions. Standard theoretical view, plus new one that may provide more guidance. Bridge between “implicit mapping” and “similarity function” views. Talk about quality of a kernel in terms of more tangible properties. [work with Nina Balcan]  Combining the above. Using kernels to generate explicit features.

A classic conceptual question  How is it possible to learn anything quickly when there is so much irrelevant information around?  Must there be some hard-coded focusing mechanism, or can learning handle it?

A classic conceptual question Let’s try a very simple theoretical model.  Have n boolean features. Labels are + or   Assume distinction is based on just one feature.  How many prediction mistakes do you need to make before you’ve figured out which one it is? Can take majority vote over all possibilities consistent with data so far. Each mistake crosses off at least half. O(log n) mistakes total. log(n) is good: doubling n only adds 1 more mistake. Can’t do better (consider log(n) random strings with random labels. Whp there is a consistent feature in hindsight).

A classic conceptual question What about more interesting classes of functions (not just target  a single feature)?

Littlestone’s Winnow algorithm [MLJ 1988]  Motivated by the question: what if target is an OR of r  n features? Majority vote scheme over all n r possibilities would make O(r log n) mistakes but totally impractical. Can you do this efficiently?  Winnow is simple efficient algorithm that meets this bound.  More generally, if exists LTF such that positives satisfy w 1 x 1 +w 2 x 2 +…+w n x n  c, negatives satisfy w 1 x 1 +w 2 x 2 +…+w n x n  c - , (W=  i |w i |)  Then # mistakes = O((W/  ) 2 log n). E.g., if target is “k of r” function, get O(r 2 log n). Key point: still only log dependence on n x 4  x 7  x 10

Littlestone’s Winnow algorithm [MLJ 1988] How does it work? Balanced version:  Maintain weight vectors w + and w -.  Initialize all weights to 1. Classify based on whether w +  x or w -  x is larger. (Have x 0  0)  If make mistake on positive x, for each x i =1, w i + = (1+  )w i +, w i - = (1-  )w i -.  And vice-versa for mistake on negative x. Other properties:  Can show this approximates maxent constraints.  In other direction, [Ng’04] shows that maxent with L 1 regularization gets Winnow-like bounds w+w-w+w-

Practical issues  On batch problem, may want to cycle through data, each time with smaller .  Can also do margin version: update if just barely correct.  If want to output a likelihood, natural is e w +  x /[e w +  x + e w -  x ]. Can extend to multiclass too.  William & Vitor have paper with some other nice practical adjustments.

Winnow versus Perceptron/SVM Winnow is similar at high level to Perceptron updates. What’s the difference?  Suppose data is linearly separable by w  x = 0 with |w  x|  .  For Perceptron, mistakes/samples bounded by O((L 2 (w)L 2 (x)/  ) 2 )  For Winnow, mistakes/samples bounded by O((L 1 (w)L  (x)/  ) 2 (log n)) For boolean features, L  (x)=1. L 2 (x) can be sqrt(n). If target is sparse, examples dense, Winnow is better. -E.g., x random in {0,1} n, f(x)=x 1. Perceptron: O(n) mistakes. If target is dense (most features are relevant) and examples are sparse, then Perceptron wins

OK, now on to kernels…

Generic problem  Given a set of images:, want to learn a linear separator to distinguish men from women.  Problem: pixel representation no good. One approach:  Pick a better set of features! But seems ad-hoc. Instead:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Perceptron/SVM only interact with data through dot- products, so can be “kernelized”. If data is separable in  -space by large L 2 margin, don’t have to pay for it.

 E.g., the kernel K(x,y) = (1+x ¢ y) d for the case of n=2, d=2, corresponds to the implicit mapping: x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z2z2 z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Kernels

Kernels  Perceptron/SVM only interact with data through dot- products, so can be “kernelized”. If data is separable in  -space by large L 2 margin, don’t have to pay for it.  E.g., K(x,y) = (1 + x  y) d  :(n-diml space) ! (n d -diml space).  E.g., K(x,y) = e -(x-y) 2  Conceptual warning: You’re not really “getting all the power of the high dimensional space without paying for it”. The margin matters. E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise. Corresponds to mapping where every example gets its own coordinate. Everything is linearly separable but no generalization.

Question: do we need the notion of an implicit space to understand what makes a kernel helpful for learning?

Focus on batch setting  Assume examples drawn from some probability distribution: Distribution D over x, labeled by target function c. Or distribution P over (x, l ) Will call P (or (c,D)) our “learning problem”.  Given labeled training data, want algorithm to do well on new data.

 On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra requirements. [here I’m scaling to |  (x)| = 1]  And in practice, people think of a good kernel as a good measure of similarity between data points for the task at hand.  But Theory talks about margins in implicit high- dimensional  -space. K(x,y) =  (x) ¢  (y). xyxy Something funny about theory of kernels

I want to use ML to classify protein structures and I’m trying to decide on a similarity fn to use. Any help? It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Umm… thanks, I guess. It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Something funny about theory of kernels  Theory talks about margins in implicit high- dimensional  -space. K(x,y) =  (x) ¢  (y). Not great for intuition (do I expect this kernel or that one to work better for me) Can we connect better with idea of a good kernel being one that is a good notion of similarity for the problem at hand? -Motivation [BBV]: If margin  in  -space, then can pick Õ(1/  2 ) random examples y 1,…,y n (“landmarks”), and do mapping x  [K(x,y 1 ),…,K(x,y n )]. Whp data in this space will be apx linearly separable.

Goal: notion of “ good similarity function ” that… 1.Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc) 2.If K satisfies these properties for our given problem, then has implications to learning 3.Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in  -space). If so, then this can help with designing the K. [Recent work with Nina, with extensions by Nati Srebro]

Proposal satisfying (1) and (2):  Say have a learning problem P (distribution D over examples labeled by unknown target f).  Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+   Q: how could you use this to learn?

How to use it At least a 1-  prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+   Draw S + of O((  2 )ln  2 ) positive examples.  Draw S - of O((  2 )ln  2 ) negative examples.  Classify x based on which gives better score. Hoeffding: for any given “good x”, prob of error over draw of S ,S  at most  . So, at most  chance our draw is bad on more than  fraction of “good x”.  With prob ¸ 1- , error rate ·  + .

But not broad enough  K(x,y)=x ¢ y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) ++ _ 30 o

But not broad enough  Idea: would work if we didn’t pick y’s from top-left.  Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. (even if don’t know R in advance) ++ _ 30 o

Broader defn…  Say K:(x,y) ! [-1,1] is an ( ,  )-good similarity function for P if exists a weighting function w(y) 2 [0,1] s.t. at least 1-  frac. of x satisfy:  Can still use for learning: Draw S + = {y 1,…,y n }, S - = {z 1,…,z n }. n=Õ(1/  2 ) Use to “triangulate” data: x  [K(x,y 1 ), …,K(x,y n ), K(x,z 1 ),…,K(x,z n )]. w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] Whp, exists good separator in this space: w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] E y~D [w(y)K(x,y)| l (y)= l (x)] ¸ E y~D [w(y)K(x,y)| l (y)  l (x)]+ 

Broader defn…  Say K:(x,y) ! [-1,1] is an ( ,  )-good similarity function for P if exists a weighting function w(y) 2 [0,1] s.t. at least 1-  frac. of x satisfy: E y~D [w(y)K(x,y)| l (y)= l (x)] ¸ E y~D [w(y)K(x,y)| l (y)  l (x)]+  w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] Whp, exists good separator in this space: w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] *Technically bounds are better if adjust definition to penalize examples more that fail the inequality badly… So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.* So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.*

Algorithm  Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Think of these as “landmarks”.  Use to “triangulate” data: X  [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Guarantee: with prob. ¸ 1- , exists linear separator of error ·  +  at margin  /4. Broader defn…  Actually, margin is good in both L 1 and L 2 senses.  This particular approach requires wasting examples for use as the “landmarks”. But could use unlabeled data for this part.

Interesting property of definition An ( ,  )-good kernel [at least 1-  fraction of x have margin ¸  ] is an (  ’,  ’)-good sim fn under this definition. But our current proofs suffer a penalty:  ’ =  +  extra,  ’ =  3  extra. So, at qualitative level, can have theory of similarity functions that doesn’t require implicit spaces. Nati Srebro has improved to  2, which is tight, + extended to hinge-loss.

Approach we’re investigating With Nina & Mugizi: Take a problem where original features already pretty good, plus you have a couple reasonable similarity functions K 1, K 2,… Take some unlabeled data as landmarks, use to enlarge feature space K 1 (x,y 1 ), K 2 (x,y 1 ), K 1 (x,y 2 ),… Run Winnow on the result. Can prove guarantees if some convex combination of the K i is good.

Open questions This view gives some sufficient conditions for a similarity function to be useful for learning but doesn’t have direct implications to direct use in SVM, say. Can one define other interesting, reasonably intuitive, sufficient conditions for a similarity function to be useful for learning?