Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Slides:



Advertisements
Similar presentations
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Advertisements

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Linear Regression.
Mechanism Design, Machine Learning, and Pricing Problems Maria-Florina Balcan.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
COMP 553: Algorithmic Game Theory Fall 2014 Yang Cai Lecture 21.
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
6.896: Topics in Algorithmic Game Theory Lecture 11 Constantinos Daskalakis.
Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
Semi-Supervised Learning
Chain Rules for Entropy
Maria-Florina Balcan Approximation Algorithms and Online Mechanisms for Item Pricing Maria-Florina Balcan & Avrim Blum CMU, CSD.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Item Pricing for Revenue Maximization in Combinatorial Auctions Maria-Florina Balcan, Carnegie Mellon University Joint with Avrim Blum and Yishay Mansour.
Active Learning of Binary Classifiers
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Maria-Florina Balcan Mechanism Design, Machine Learning, and Pricing Problems Maria-Florina Balcan 11/13/2007.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Inferential Statistics
Incorporating Unlabeled Data in the Learning Process
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
ESTIMATING with confidence. Confidence INterval A confidence interval gives an estimated range of values which is likely to include an unknown population.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
6.853: Topics in Algorithmic Game Theory Fall 2011 Constantinos Daskalakis Lecture 11.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Maria-Florina Balcan Mechanism Design, Machine Learning and Pricing Problems Maria-Florina Balcan Joint work with Avrim Blum, Jason Hartline, and Yishay.
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
October 5, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL7.1 Prof. Charles E. Leiserson L ECTURE 8 Hashing II Universal hashing Universality.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Random Variables and Probability Distributions. Definition A random variable is a real-valued function whose domain is the sample space for some experiment.
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Mechanism Design via Machine Learning
Maria Florina Balcan 03/04/2010
Welcome to the Kernel-Club
CSCI B609: “Foundations of Data Science”
A Theory of Learning and Clustering via Similarity Functions
Presentation transcript:

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD

Maria-Florina Balcan Kernels and Similarity Functions Useful in practice for dealing with many different kinds of data. Elegant theory about what makes a given kernel good for a given learning problem. Our Goal: analyze more general similarity functions. In the process we describe ways of constructing good data dependent kernels. Kernels have become a powerful tool in ML.

Maria-Florina Balcan Kernels A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=  (x) ¢  (y). Point is: many learning algorithms can be written so only interact with data via dot-products. If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. If data is linearly separable by large margin in  -space, don’t have to pay in terms of data or comp time. If margin  in  -space, only need 1/  2 examples to learn well. w  (x)  1

Maria-Florina Balcan General Similarity Functions Goal: definition of good similarity function for a learning problem that: 1) Talks in terms of natural direct properties: no implicit high-dimensional spaces no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in  -space)

Maria-Florina Balcan A First Attempt: Definition satisfying properties (1) and (2) K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Note: this might not be a legal kernel. Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Let P be a distribution over labeled examples (x, l (x)) A B C + - -

Maria-Florina Balcan A First Attempt: Definition satisfying properties (1) and (2). How to use it? K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Algorithm Draw S + of O((1/  2 ) ln(1/  2 )) positive examples. Draw S - of O((1/  2 ) ln(1/  2 )) negative examples. Classify x based on which gives better score.

Maria-Florina Balcan A First Attempt: How to use it? K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: Algorithm Draw S + of O((1/  2 ) ln(1/  2 )) positive examples. Draw S - of O((1/  2 ) ln(1/  2 )) negative examples. Classify x based on which gives better score. Hoeffding: for any given “good x”, probability of error w.r.t. x (over draw of S +, S -) at most  2. By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate ·  + . Guarantee: with probability ¸ 1- , error ·  +  Proof E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

Maria-Florina Balcan A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition more similar to negs than to typical pos

Maria-Florina Balcan A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: Idea: would work if we didn’t pick y’s rom top-left. Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  R

Maria-Florina Balcan Broader/Main Definition K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+ 

Maria-Florina Balcan Main Definition, How to Use It K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+  Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data: F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1- , exists linear separator of error ·  +  at margin  /4. (w = [w(y 1 ), …,w(y d ),-w(z d ),…,-w(z d )])

Maria-Florina Balcan Main Definition, Implications Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data:F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Guarantee: with prob. ¸ 1- , exists linear separator of error ·  +  at margin  /4. Implications legal kernel K arbitrary sim. function ( ,  )-good sim. function (  + ,  /4)-good kernel function

Maria-Florina Balcan Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+  An ( ,  )-good kernel is an (  ’,  ’)-good similarity function under main definition. Theorem Our current proofs incur some penalty:  ’ =  +  extra,  ’ =  3  extra.

Maria-Florina Balcan Good Kernels are Good Similarity Functions An ( ,  )-good kernel is an (  ’,  ’)-good similarity function under main definition, where Theorem  ’ =  +  extra,  ’ =  3  extra. Proof Sketch Suppose K is a good kernel in usual sense. Then, standard margin bounds imply: –if S is a random sample of size Õ(1/(  2 )), then whp we can give weights w S (y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. But, we want sample-independent weights [and bounded]. –Boundedness not too hard (imagine a margin-perceptron run over just the good y). –Get sample-independence using an averaging argument.

Maria-Florina Balcan Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data: F(x) = [K 1 (x,y 1 ), …,K r (x,y d ), K 1 (x,z d ),…,K r (x,z d )]. Guarantee: The induced distribution F(P) in R 2dr has a separator of error ·  +  at margin at least Algorithm Sample complexity is roughly

Maria-Florina Balcan Implications & Conclusions Develop theory that provides a formal way of understanding kernels as similarity function. Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Problems Better results for learning with multiple similarity functions. Extending [SB’06]. Improve existing bounds.