On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,

Slides:



Advertisements
Similar presentations
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Advertisements

On-line learning and Boosting
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Boosting Approach to ML
Why do we want a good ratio anyway? Approximation stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work joint with Pranjal.
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Machine Learning Week 2 Lecture 2.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Active Learning with Support Vector Machines
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Inferential Statistics
Online Learning Algorithms
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
SVM by Sequential Minimal Optimization (SMO)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Finding Low Error Clusterings TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Maria-Florina Balcan.
Harnessing implicit assumptions in problem formulations: Approximation-stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work.
Stability Yields a PTAS for k-Median and k-Means Clustering
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CH. 2: Supervised Learning
CS 4/527: Artificial Intelligence
NP-Completeness Yin Tat Lee
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Maria Florina Balcan 03/04/2010
CSCI B609: “Foundations of Data Science”
NP-Completeness Yin Tat Lee
A Theory of Learning and Clustering via Similarity Functions
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro, and Santosh Vempala]

2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ). In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings.In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings. Q: Can we bridge the gap? Theory that just views K as a measure of similarity? Ideally, make it easier to design good functions, & be more general too.

2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ). In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings.In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings. Q: What if we only have unlabeled data (i.e., clustering)? Can we develop a theory of properties that are sufficient to be able to cluster well?

2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ). In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings.In practice, choose K to be good measure of similarity, but theory in terms of implicit mappings. Develop a kind of PAC model for clustering.

Part 1: On similarity functions for learning

Kernel functions and Learning Back to our generic classification problem. E.g., given a set of images, labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data]Back to our generic classification problem. E.g., given a set of images, labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data] Problem: our best algorithms learn linear separators, but might not be good for data in its natural representation.Problem: our best algorithms learn linear separators, but might not be good for data in its natural representation. –Old approach: use a more complex class of functions. –New approach: use a kernel.                  

What’s a kernel? A kernel K is a legal def of dot-product: fn s.t. there exists an implicit mapping  K such that K(, )=  K ( ) ¢  K ( ). E.g., K(x,y) = (x ¢ y + 1) d. –  K :(n-diml space) ! (n d -diml space). Point is: many learning algs can be written so only interact with data via dot-products. –If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. Kernel should be pos. semi- definite (PSD)

E.g., for the case of n=2, d=2, the kernel K(x,y) = (1 + x ¢ y) d corresponds to the mapping: x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z2z2 z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Example

Moreover, generalize well if good margin Kernels found to be useful in practice for dealing with many, many different kinds of data. If data is lin. separable by margin  in  -space, then need sample size only Õ(1/  2 ) to get confidence in generalization. Assume |  (x)| ·  

…but there’s something a little funny: On the one hand, operationally a kernel is just a similarity measure: K(x,y) 2 [-1,1], with some extra reqts. But Theory talks about margins in implicit high-dimensional  -space. K(x,y) =  (x) ¢  (y). xyxy Moreover, generalize well if good margin

I want to use ML to classify protein structures and I’m trying to decide on a similarity fn to use. Any help? It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Umm… thanks, I guess. It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Moreover, generalize well if good margin …but there’s something a little funny: On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra reqts. But Theory talks about margins in implicit high-dimensional  -space. K(x,y) =  (x) ¢  (y). –Can we bridge the gap? –Standard theory has a something-for-nothing feel to it. “All the power of the high-dim’l implicit space without having to pay for it”. More prosaic explanation? xyxy

Question: do we need the notion of an implicit space to understand what makes a kernel helpful for learning?

Goal: notion of “good similarity function” for a learning problem that… 1.Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc)  E.g., natural properties of weighted graph induced by K. 2.If K satisfies these properties for our given problem, then has implications to learning 3.Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in  - space).

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  “most x are on average more similar to points y of their own type than to points y of the other type”

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  Note: it’s possible to satisfy this and not even be a valid kernel. E.g., K(x,y) = 0.2 within each class, uniform random in {-1,1} between classes.

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  How can we use it?

How to use it At least a 1-  prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  Draw S + of O((  2 )ln  2 ) positive examples. Draw S - of O((  2 )ln  2 ) negative examples. Classify x based on which gives better score. –Hoeffding: for any given “good x”, prob of error over draw of S ,S  at most  . –So, at most  chance our draw is bad on more than  fraction of “good x”. With prob ¸ 1- , error rate ·  + .

But not broad enough K(x,y)=x ¢ y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) ++ _ Avg simil to negs is 0.5, but to pos is only 0.25

But not broad enough Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. (even if don’t know R in advance) ++ _

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy And at least  probability mass of reasonable positives/negatives. But now, how can we use for learning?? E y [K(x,y)| l (y)= l (x), R(y)] ¸ E y [K(x,y)| l (y)  l (x), R(y)]+ 

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy –Draw S = {y 1,…,y n }, n ¼ 1/(  2  ). –View as “landmarks”, use to map new data: F(x) = [K(x,y 1 ), …,K(x,y n )]. w=[0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] –Whp, exists separator of good L 1 margin in this space: w=[0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] take new set of examples, project to this space, and run good L 1 alg (Winnow). –So, take new set of examples, project to this space, and run good L 1 alg (Winnow). E y [K(x,y)| l (y)= l (x), R(y)] ¸ E y [K(x,y)| l (y)  l (x), R(y)]+  could be unlabeled

And furthermore Now, defn is broad enough to include all large margin kernels (some loss in parameters) : –  -good margin ) apx ( ,  2,  )-good here. But now, we don’t need to think about implicit spaces or require kernel to even have the implicit space interpretation. If PSD, can also show reverse too: –  -good here & PSD )  -good margin.

And furthermore In fact, can even show a separation. Consider a class C of n pairwise uncorrelated functions over n examples (unif distrib). Can show that for any kernel K, expected margin for random f in C would be O(1/n 1/2 ). But, can define a similarity function with  =1, P(R)=1/n. [K(x i,x j )=f j (x i )f j (x j )] technically, easier using slight variant on def: E y [K(x,y) l (x) l (y) | R(y)] ¸ 

Summary: part 1 Can develop sufficient conditions for a similarity fn to be useful for learning that don’t require implicit spaces. Property includes usual notion of “good kernels” modulo the loss in some parameters. –Can apply to similarity fns that aren’t positive- semidefinite (or even symmetric). kernel defn 1 defn 2

Summary: part 1 Potentially other interesting sufficient conditions too. E.g., [WangYangFeng07] motivated by boosting. Ideally, these more intuitive conditions can help guide the design of similarity fns for a given application.

Can we use this angle to help think about clustering? Part 2: Can we use this angle to help think about clustering?

Consider the following setting: Given data set S of n objects. There is some (unknown) “ground truth” clustering. Each x has true label l (x) in {1,…, t }. Goal: produce hypothesis h of low error up to isomorphism of label names. But, we are given a pairwise similarity fn K. [documents, web pages] [topic] Problem: only have unlabeled data! Can we use this angle to help think about clustering?

Consider the following setting: Given data set S of n objects. There is some (unknown) “ground truth” clustering. Each x has true label l (x) in {1,…, t }. Goal: produce hypothesis h of low error up to isomorphism of label names. But, we are given a pairwise similarity fn K. [documents, web pages] [topic] Problem: only have unlabeled data! What conditions on a similarity function would be enough to allow one to cluster well?

Note: more common algorithmic approach: view weighted graph induced by K as ground truth; try to optimize various objectives. Here, we view target as ground truth. Ask: how should K be related to let us get at it? What conditions on a similarity function would be enough to allow one to cluster well? Will lead to something like a PAC model for clustering.

E.g., say you want alg to cluster docs the way *you* would. How closely related does K have to be to what’s in your head? Or, given a property you think K has, what algs does that suggest? What conditions on a similarity function would be enough to allow one to cluster well? Will lead to something like a PAC model for clustering.

Here is a condition that trivially works: Suppose K has property that: K(x,y) > 0 for all x,y such that l (x) = l (y).K(x,y) > 0 for all x,y such that l (x) = l (y). K(x,y) < 0 for all x,y such that l (x)  l (y).K(x,y) < 0 for all x,y such that l (x)  l (y). If we have such a K, then clustering is easy. Now, let’s try to make this condition a little weaker…. What conditions on a similarity function would be enough to allow one to cluster well?

baseball basketball Suppose K has property that all x are more similar to points y in their own cluster than to any y’ in other clusters. Still a very strong condition.Still a very strong condition. Problem: the same K can satisfy for two very different clusterings of the same data! What conditions on a similarity function would be enough to allow one to cluster well? Math Physics

Suppose K has property that all x are more similar to points y in their own cluster than to any y’ in other clusters. Still a very strong condition.Still a very strong condition. Problem: the same K can satisfy for two very different clusterings of the same data! What conditions on a similarity function would be enough to allow one to cluster well? Unlike learning, you can’t even test your hypotheses! baseball basketball Math Physics

Let’s weaken our goals a bit… 1.OK to produce a hierarchical clustering (tree) such that correct answer is apx some pruning of it. –E.g., in case from last slide: 2.OK to output a small # of clusterings such that at least one has low error. (won’t talk about this one here) baseball basketball Math Physics baseball basketball math physics sports science all documents

Then you can start getting somewhere…. 1. is sufficient to get hierarchical clustering such that target is some pruning of tree. (Kruskal’s / single-linkage works) 2. Weaker condition: ground truth is “stable”: For all clusters C, C’, for all A ½ C, A’ ½ C’: A and A’ are not both more similar to each other than to rest of their own clusters. “all x more similar to all y in their own cluster than to any y’ from any other cluster” K(x,y) is attraction between x and y

Example analysis for simpler version Assume for all C, C’, all A ½ C, A’ µ C’, we have K(A,C-A) > K(A,A’), and say K is symmetric. Algorithm: average single-linkage Like Kruskal, but at each step merge pair of clusters whose average similarity is highest.Like Kruskal, but at each step merge pair of clusters whose average similarity is highest. Analysis: (all clusters made are laminar wrt target) Failure iff merge C 1, C 2 s.t. C 1 ½ C, C 2 Å C = . But must exist C 3 ½ C s.t. K(C 1,C 3 ) ¸ K(C 1,C-C 1 ), and K(C 1,C-C 1 ) > K(C 1,C 2 ). Contradiction. Avg x 2 A, y 2 C-A [K(x,y)] It’s getting late, let’s skip the proof

Example analysis for simpler version Assume for all C, C’, all A ½ C, A’ µ C’, we have K(A,C-A) > K(A,A’), and say K is symmetric. Algorithm breaks down if K is not symmetric: Instead, run “Boruvka-inspired” algorithm: –Each current cluster C i points to argmax C j K(C i,C j ) –Merge directed cycles. (not all components) [Think of K as “attraction”] Avg x 2 A, y 2 C-A [K(x,y)]

More general conditions What if only require stability for large sets? (Assume all true clusters are large.)  E.g, take example satisfying stability for all sets but add noise.  Might cause bottom-up algorithms to fail. Instead, can pick some points at random, guess their labels, and use to cluster the rest. Produces big list of candidates. Then 2 nd testing step to hook up clusters into a tree. Running time not great though. (exponential in # topics)

Other properties Can also relate to implicit assumptions made by approx algorithms for standard objectives like k-median. –E.g., if you assume that any apx k- median solution must be close to the target, this implies that most points satisfy simple ordering condition.

Like a PAC model for clustering PAC learning model: basic object of study is the concept class (a set of functions). Look at which are learnable and by what algs.PAC learning model: basic object of study is the concept class (a set of functions). Look at which are learnable and by what algs. In our case, basic object of study is a property: like a collection of (target, similarity function) pairs. Want to know which allow clustering and by what algs.In our case, basic object of study is a property: like a collection of (target, similarity function) pairs. Want to know which allow clustering and by what algs.

Conclusions What properties of a similarity function are sufficient for it to be useful for clustering? –View as unlabeled-data multiclass learning prob. (Target fn as ground truth rather than graph) –To get interesting theory, need to relax what we mean by “useful”. –Can view as a kind of PAC model for clustering. –A lot of interesting directions to explore.

Conclusions –Natural properties (relations between sim fn and target) that motivate spectral methods? –Efficient algorithms for other properties? E.g., “stability of large subsets” –Other notions of “useful”. Produce a small DAG instead of a tree? Others based on different kinds of feedback?

Overall Conclusions Theoretical approach to question: what are minimal conditions that allow a similarity to be useful for learning/clustering. For learning, formal way of analyzing kernels as similarity functions. –Doesn’t require reference to implicit spaces or PSD properties. For clustering, “reverses” the usual view. Can think of as a PAC model for clustering. Property, concept class Lot of interesting directions to explore.