A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Boosting Approach to ML
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
A general agnostic active learning algorithm
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Part I: Classification and Bayesian Learning
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Limits of Learning-based Signature Generation with Adversaries Shobha Venkataraman, Carnegie Mellon University Avrim Blum, Carnegie Mellon University.
Notes on local determinism Days 12, 13 and 14 of Comp Sci 480.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Finding Low Error Clusterings TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Maria-Florina Balcan.
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Correlation Clustering
Unsupervised Learning
CH. 2: Supervised Learning
CS 4/527: Artificial Intelligence
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
INF 5860 Machine learning for image classification
Computational Learning Theory
Computational Learning Theory
Maria Florina Balcan 03/04/2010
CSCI B609: “Foundations of Data Science”
A Theory of Learning and Clustering via Similarity Functions
Presentation transcript:

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon University

2-Minute Version Generic classification problem: learn to distinguish men from women. Problem: pixel representation not so good. Powerful technique: use a kernel, a special kind of similarity function K(, ). Can we develop a theory that views K as a measure of similarity? What are general sufficient conditions for K to be useful for learning? But, theory in terms of implicit mappings. Nice SLT theory

2-Minute Version Generic classification problem: learn to distinguish men from women. Problem: pixel representation not so good. Powerful technique: use a kernel, a special kind of similarity function K(, ). What if don’t have any labeled data? (i.e., clustering) Can we develop a theory of conditions sufficient for K to be useful now?

Part I: On Similarity Functions for Classification

Kernel Functions and Learning                   E.g., given images labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data] Problem: our best algorithms learn linear separators, not good for data in natural representation. Old approach: learn a more complex class of functions. New approach: use a Kernel.

Kernels, Kernalizable Algorithms K kernel if 9 implicit mapping  s.t. K(x,y)=  (x) ¢  (y). Point: many algorithms interact with data only via dot-products. If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. If data is linearly separable by large margin in  -space, don’t have to pay in terms of sample complexity or comp time. If margin  in  -space, only need 1/  2 examples to learn well. w  (x)  1

Kernels and Similarity Functions Our Work: analyze more general similarity functions. Kernels: useful for many kinds of data, elegant SLT. Characterization of good similarity functions: 1) In terms of natural direct properties. no implicit high-dimensional spaces no requirement of positive-semidefiniteness 2) If K satisfies these, can be used for learning. 3) Is broad: includes usual notion of “good kernel”. has a large margin sep. in  -space

A First Attempt: Definition Satisfying (1) and (2) K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Note: might not be a legal kernel. E.g., K(x,y) ¸ 0.2, l (x) = l (y) P distribution over labeled examples (x, l (x)) K(x,y) random in [-1,1], l (x)  l (y)

A First Attempt: Definition Satisfying (1) and (2). How to use it? K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if a 1-  prob. mass of x satisfy: Algorithm Draw S + of O((1/  2 ) ln(1/  2 )) positive examples. Draw S - of O((1/  2 ) ln(1/  2 )) negative examples. Classify x based on which gives better score. Guarantee: with probability ¸ 1- , error ·  +  E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

A First Attempt: Definition Satisfying (1) and (2). How to use it? Hoeffding: for any given “good x”, prob. of error w.r.t. x (over draw of S +, S - ) is ·  2. At most  chance that the error rate over GOOD is ¸ . Guarantee: with probability ¸ 1- , error ·  +  Overall error rate ·  + . K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  K(x,y)=x ¢ y has large margin separator but doesn’t satisfy our definition more similar to + than to typical -

A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if a 1-  prob. mass of x satisfy: Broaden: OK if 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  R

Broader/Main Definition K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] a 1-  prob. mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+  Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). “Triangulate” data: F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Take a new set of labeled examples, project to this space, and run any alg for learning lin. separators. Theorem: with probability ¸ 1- , exists linear separator of error ·  +  at margin  /4.

Main Definition & Algorithm, Implications S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). “Triangulate” data:F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Theorem: with prob. ¸ 1- , exists linear separator of error ·  +  at margin  /4. legal kernel K arbitrary sim. function ( ,  )-good sim. function (  + ,  /4)-good kernel function Any ( ,  )-good kernel is an (  ’,  ’)-good similarity function. Theorem (some penalty:  ’ =  +  extra,  ’ =  2  extra )

Similarity Functions for Classification, Summary Formal way of understanding kernels as similarity functions. Algorithms and guarantees for general similarity functions that aren’t necessarily PSD.

Part II: Can we use this angle to help think about Clustering?

What if only unlabeled examples available? [documents, images] [topic] Problem: only have unlabeled data! S set of n objects. There is some (unknown) “ground truth” clustering. Goal: h of low error up to isomorphism of label names. But we have a Similarity function! Each object has true label l(x) in {1,…, t }. [sports] [fashion] Err(h) = min  Pr x~S [  (h(x))  l(x)]

[documents, images] [topic] Problem: only have unlabeled data! S set of n objects. There is some (unknown) “ground truth” clustering. Goal: h of low error up to isomorphism of label names. But we have a Similarity function! Each object has true label l(x) in {1,…, t }. [sports] [fashion] Err(h) = min  Pr x~S [  (h(x))  l(x)] What conditions on a similarity function would be enough to allow one to cluster well?

- closer to learning mixtures of Gaussians - analyze algos to optimize various criteria - which criterion produces “better-looking” results We flip this perspective around. - discriminative, not generative More natural, since the input graph/similarity is merely based on some heuristic. Contrast with “Standard” Approach Traditional approach: the input is a graph or embedding of points into R d.

[sports] [fashion] What conditions on a similarity function would be enough to allow one to cluster well? Condition that trivially works. K(x,y) > 0 for all x,y, l(x) = l(y). K(x,y) 0 for all x,y, l(x) = l(y). K(x,y) < 0 for all x,y, l(x)  l(y).

What conditions on a similarity function would be enough to allow one to cluster well? Problem: same K can satisfy it for two very different clusterings of the same data! K is s.t. all x are more similar to points y in their own cluster than to any y’ in other clusters. Still Strong Unlike learning, you can’t even test your hypotheses! sports fashion soccer tennis Lacoste Coco Chanel sports fashion soccer tennis Lacoste Coco Chanel Strict Ordering Property

Relax Our Goals soccer tennis Lacoste Coco Chanel 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer sports fashion Coco Chanel tennis Lacoste All topics

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer sports fashion Coco Chanel tennis Lacoste All topics

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. sports fashion Coco Chanel Lacoste All topics

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer sports fashion Coco Chanel tennis Lacoste All topics

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer sports fashion Coco Chanel tennis Lacoste All topics

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. 2. List of clusterings s.t. at least one has low error. Tradeoff strength of assumption with size of list. soccer sports fashion tennis All topics

Start Getting Nice Algorithms/Properties For all clusters C, C’, for all A in C, A’ in C’: at least one of A, A’ is more attracted to its own cluster than to the other. A A’ K is s.t. all x are more similar to points y in their own cluster than to any y’ in other clusters. Sufficient for hierarchical clustering Strict Ordering Property Weak Stability Property Sufficient for hierarchical clustering

Example Analysis for Strong Stability Property K is s.t. for all C, C’, all A in C, A’ in C’ K(A,C-A) > K(A,A’), Failure iff merge P 1, P 2 s.t. P 1 ½ C, P 2 Å C = . But must exist P 3 ½ C s.t. K(P 1,P 3 ) ¸ K(P 1,C-P 1 ) and K(P 1,C-P 1 ) > K(P 1,P 2 ). Average Single-Linkage. merge “parts” whose average similarity is highest. All “parts” made are laminar wrt target clustering. Contradiction. Algorithm Analysis: (K(A,A’) - average attraction between A and A’)

Strong Stability Property, Inductive Setting Assume for all C, C’, all A ½ C, A’ µ C’: K(A,C-A) > K(A,A’)+  –Need to argue that sampling preserves stability. Insert new points as they arrive. Draw sample S, hierarchically partition S. –A sample cplx type argument using Regularity type results of [AFKK]. Inductive Setting

Weaker Conditions E x’ 2 C(x) [K(x,x’)] > E x’ 2 C’ [K(x,x’)]+  E x’ 2 C(x) [K(x,x’)] > E x’ 2 C’ [K(x,x’)]+  ( 8 C’  C(x)) Can produce a small list of clusterings. Upper bound t O( t /  2 ). [doesn’t depend on n] Lower bound ~ t  (1/  ). Might cause bottom-up algorithms to fail. Find hierarchy using learning-based algorithm. Average Attraction Property Stability of Large Subsets Property Not Sufficient for hierarchy Sufficient for hierarchy (running time t O( t /  2 ) ) A A’

Similarity Functions for Clustering, Summary Minimal conditions on K to be useful for clustering. –List Clustering –Hierarchical clustering Discriminative/SLT-style model for Clustering with non-interactive feedback. Our notion of property: analogue of a data dependent concept class in classification.