Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.

Slides:

Advertisements

Similar presentations

Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]

Advertisements

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.

Evaluating Classifiers

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support vector machine

Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.

On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.

A general agnostic active learning algorithm

Semi-Supervised Learning

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.

Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Support Vector Machines and Kernel Methods

Machine Learning Week 2 Lecture 2.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Active Learning of Binary Classifiers

Probably Approximately Correct Model (PAC)

Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.

Evaluating Hypotheses

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

New Theoretical Frameworks for Machine Learning

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,

Support Vector Machines

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.

Visual Recognition Tutorial

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Incorporating Unlabeled Data in the Learning Process

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maria-Florina Balcan Mechanism Design, Machine Learning and Pricing Problems Maria-Florina Balcan Joint work with Avrim Blum, Jason Hartline, and Yishay.

CpSc 881: Machine Learning Evaluating Hypotheses.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.

5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.

Kernels and Margins Maria Florina Balcan 10/13/2011.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Learning with General Similarity Functions Maria-Florina Balcan.

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,

Computational Learning Theory

Combining Labeled and Unlabeled Data with Co-Training

Semi-Supervised Learning

CSCI B609: “Foundations of Data Science”

Maria Florina Balcan 03/04/2010

CSCI B609: “Foundations of Data Science”

A Theory of Learning and Clustering via Similarity Functions

Presentation transcript:

Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data Dependent Concept Spaces]

Theme of the workshop: Data- Dependent Concept Spaces Using the data to determine the set of functions you are viewing as most natural. –E.g., like large margin separators –Pre-bias + data ! Bias –Much is even without considering labels

Theme of the workshop: Data- Dependent Concept Spaces This talk: two key settings where this is esp helpful. –Semi-supervised learning –Learning from similarity functions

Semi-Supervised Learning Can we use unlabeled data to augment a small labeled sample to improve learning? But unlabeled data is missing the most important info!! But maybe can help us instantiate our biases.

Semi-Supervised Learning 1 st half of this talk: Some examples of SSL algorithms using different kinds of biases. Unified framework for analyzing in terms of Distribution-dependent concept spaces [joint work with Nina Balcan] –Understand when unlabeled data can help –Help make sense of what’s going on.

Ex 1: Co-training Many problems have two different sources of info you can use to determine label. –Example x is a pair h x 1, x 2 i E.g., classifying webpages: can use words on page or words on links pointing to the page. My AdvisorProf. Avrim BlumMy AdvisorProf. Avrim Blum x 2 - Text info x 1 - Link info x - Link info & Text info

Ex 1: Co-training Idea: Use small labeled sample to learn initial rules. –E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page. –E.g., “I am teaching” on a page is a good indicator it is a faculty home page. my advisor

Ex 1: Co-training Idea: Use small labeled sample to learn initial rules. –E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page. –E.g., “I am teaching” on a page is a good indicator it is a faculty home page. Then look for unlabeled examples where one rule is confident and the other is not. Have it label the example for the other. Training 2 classifiers, one on each type of info. Using each to help train the other. h x 1,x 2 i

Simple example: intervals As a simple example, suppose x 1, x 2 2 R. Target function is some interval [a,b] a1a1 b1b1 a2a2 b2b2

Ex 1: Co-training Or can directly optimize joint objective: –Accuracy on labeled sample –Agreement on unlabeled sample

Ex 1: Co-training Turns out a number of problems can be set up this way. E.g., [Levin-Viola-Freund03] identifying objects in images. Two different kinds of preprocessing. E.g., [Collins&Singer99] named-entity extraction. –“I arrived in London yesterday” …

Ex 2: S 3 VM [Joachims98] Suppose we believe target separator goes through low density regions of the space/large margin. Aim for separator with large margin wrt labeled and unlabeled data. (L+U) + + _ _ Labeled data only + + _ _ + + _ _ S^3VM SVM

Ex 2: S 3 VM [Joachims98] Suppose we believe target separator goes through low density regions of the space/large margin. Aim for separator with large margin wrt labeled and unlabeled data. (L+U) Unfortunately, optimization problem is now NP- hard. –So do local optimization or branch-and-bound etc. + + _ _ + + _ _ + + _ _

Ex 3: Graph-based methods Suppose we believe that very similar examples probably have the same label. If you have a lot of labeled data, this suggests a Nearest-Neighbor type of alg. If you have a lot of unlabeled data, perhaps can use them as “stepping stones” E.g., handwritten digits [Zhu07]:

Ex 3: Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together.

Ex 3: Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together.

Ex 3: Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together. Solve for: –Minimum cut [BC,BLRR] –Minimum “soft-cut” [ZGL]  e=(u,v) (f(u)-f(v)) 2 –Spectral partitioning [J] –…

Several different approaches. Is there some unifying principle? What should be true about the world for unlabeled data to help?

Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. Semi-supervised model [BB05] High-level idea:

In standard PAC, bias is a set of or score over possible targets. Concept class SSL: replace this with score over (f,D) pairs [fn from D to score over f’s] Distribution-dependent concept class Semi-supervised model [BB05] High-level idea:  that can estimate from a finite sample

Augment notion of a concept class C with a compatibility score  (h,D) between a concept and the data distribution. “learn C” becomes “learn (C,  )” (learn class C under  ) (C,  ) express beliefs/hopes about the world. Idea I: use unlabeled data & belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. + + _ _ Class of fns C e.g., linear separators unlabeled data Idea II: require that the degree of compatibility can be estimated from a finite sample. abstract prior  Compatible fns in C e.g., large margin linear separators finite sample Semi-supervised model [BB05]

Formally  (h,D)Require compatibility  (h,D) to be expectation over individual examples. (don’t need to be so strict but this is cleanest) –View incompatibility as unlabeled loss fn l unl (h,x) 2 [0,1]. Unlabeled error rate Err unl (h) = E x » D [ l unl (h,x)]  (h,D) = 1 – Err unl (h)  (h,D) = 1 – Err unl (h) Co-training: Err unl (h) = probability mass on pts h x 1,x 2 i where h 1 (x 1 )  h 2 (x 2 )

Formally  (h,D)Require compatibility  (h,D) to be expectation over individual examples. (don’t need to be so strict but this is cleanest) –View incompatibility as unlabeled loss fn l unl (h,x) 2 [0,1]. Unlabeled error rate Err unl (h) = E x » D [ l unl (h,x)]  (h,D) = 1 – Err unl (h)  (h,D) = 1 – Err unl (h) Co-training: Err unl (h) = probability mass on pts h x 1,x 2 i where h 1 (x 1 )  h 2 (x 2 ) S 3 VM: Err unl (h) = probability mass near to separator h. l unl (h,x) 0 

Can use to prove sample bounds Can use to prove sample bounds Simple example theorem: (believe target is fully compatible) Define C D (  ) = {h 2 C : err unl (h) ·  }. Bound the # of labeled examples as a measure of the helpfulness of D wrt our incompatibility score. –a helpful distribution is one in which C(  ) is small

Can use to prove sample bounds Can use to prove sample bounds Simple example theorem: (believe target is fully compatible) Define C D (  ) = {h 2 C : err unl (h) ·  }. E.g., Co-training: if all x 1 =x 2 then target compatible but so is every other h 2 C. Not so helpful…

Can use to prove sample bounds Can use to prove sample bounds Simple example theorem: (believe target is fully compatible) Define C D (  ) = {h 2 C : err unl (h) ·  }. Extend to case where target not fully compatible. Then care about {h 2 C : err unl (h) ·  err unl (f * )}. D is helpful if this is small

Can use to prove sample bounds Can use to prove sample bounds Agnostic version: (not sure how compatible target is) Notes: unlabeled set needs to be large compared to max[dim(C), dim(  C )] where  C = class induced by  & C:  h (x) =  (h,x)

Examples Say target is monotone disjunction e.g., x 1 Ç x 3 Ç x 5. Example x is compatible if vars(x) µ vars(f) or vars(x) Å vars(f) = . [e.g., , ] –Err unl (h) = Pr x » D [x has rel and irrel vars] Can use unlabeled data to group variables into buckets. –Data-dependent concept space: disjunctions consistent with the buckets. Good and bad distributions: –Bad distrib: uniform over singleton examples. –Good distrib: D + and D - are product distribs (or more generally no small cuts)

Examples Target is large-margin separator, distribution is uniform on small number k of well-shaped blobs. If # of unlabeled is suff lg, need only O(k/  ) labeled examples.

Interesting feature  -Cover bounds often give substantial gain compared to uniform convergence bounds. Nice example: –Examples are pairs of points in {0,1} n. –C = linear separators –Example is compatible if both parts positive or both parts negative (like simple form of co-training). –D = independent given the label. If poly(n) unlabeled, o(log n) labeled, then still many bad compatible, consistent hypotheses. But  -cover size is constant. All compatible fns are close to (a) target, (b) : target, (c) 1, or (d) 0.

Conclusions part 1 Data-dependent concept spaces can give a unifying view of semi-supervised learning. Beliefs over how target should relate to distribution. –View as notion of unlabeled error rate. –Unlabeled data used to estimate your bias –Can do DDSRM to trade off.

Topic 2: Learning with similarity functions [joint work with Nina Balcan and Nati Srebro]

A kernel K is a legal def of dot-product: fn s.t. there exists an implicit mapping  K such that K(, )=  K ( ) ¢  K ( ). –E.g., K(x,y) = (x ¢ y + 1) d. –  K :(n-diml space) ! (n d -diml space). Point is: many learning algs can be written so only interact with data via dot-products. –If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. Kernel should be pos. semidefinite (PSD) Quick Reminder on Kernels

Moreover, generalize well if good margin Kernels found to be useful in practice for dealing with many, many different kinds of data. If data is lin. separable by margin  in  -space, then need sample size only Õ(1/  2 ) to get confidence in generalization. Assume |  (x)| ·  

…but there’s a bit of a disconnect: Intuitively we think of a good kernel for some problem as one that makes sense as a good notion of similarity. But Theory talks about margins in implicit high-dimensional  -space. Moreover, generalize well if good margin

Can we define a notion of a good measure of similarity that… 1.Talks in terms of more intuitive quantities (no implicit high-diml spaces, no PSD requirement, etc) 2.Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in  -space). 3.Maybe even allows one to learn classes that have no large-margin kernels? Use notion of data-dependent concept spaces

Warmup property Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1]. Consider asking for property that most examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  “on average, more similar to points y of its own type than to points y of the other type, by some gap  ” Then this would be great for learning

Just do average-nearest-nbr At least a 1-  ’ prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  Draw S + of O((  2 )ln  ) positive examples. Draw S - of O((  2 )ln  ) negative examples Classify new examples based on which gives better score. With prob ¸ 1- , S + and S - give you error rate ·  +  ’.

But not broad enough K(x,y)=x ¢ y has good separator but doesn’t satisfy condition. (half of + are more similar to - that to typical + ) ++ _ Avg simil to negs is 0.5, but to pos is only 0.25

But not broad enough Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. (even if don’t know R in advance) ++ _

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy Formally, say K is (  ’, ,  )-good if have hinge-loss  ’, and Pr(R) ¸ . Thm 1: this is a legitimate way to think about good kernels: –If kernel has margin  in implicit space, then for any  is ( ,  2,  )-good in this sense. E y [ l (x) l (y)K(x,y)|y 2 R] ¸ 

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy Formally, say K is (  ’, ,  )-good if have hinge-loss  ’, and Pr(R) ¸ . Thm 2: this is sufficient for learning. –If K is (  ’, ,  )-good,  ’ ¿ , can learn to error  with O((1/  2 )log(1/  )) labeled examples. E y [ l (x) l (y)K(x,y)|y 2 R] ¸  [and Õ(1/(  2  )) unlabeled examples]

How to use such a sim fn? Assume exists R s.t. Pr y [R] ¸  and almost all x satisfy –Draw S = {y 1,…,y n }, n ¼ 1/(  2  ). –View as “landmarks”, use to map new data: F(x) = [K(x,y 1 ), …,K(x,y n )]. w=[0,0,1/n R,1/n R,0,0,0,-1/n R,0] –Whp, exists separator of good L 1 margin in this space: w=[0,0,1/n R,1/n R,0,0,0,-1/n R,0] (n R = # y i 2 R) could be unlabeled E y [ l (x) l (y)K(x,y)|y 2 R] ¸  RnRn F F(P) O O O O O X X X X X XX X XX O O O O O P

How to use such a sim fn? Assume exists R s.t. Pr y [R] ¸  and almost all x satisfy –Draw S = {y 1,…,y n }, n ¼ 1/(  2  ). –View as “landmarks”, use to map new data: F(x) = [K(x,y 1 ), …,K(x,y n )]. w=[0,0,1/n R,1/n R,0,0,0,-1/n R,0] –Whp, exists separator of good L 1 margin in this space: w=[0,0,1/n R,1/n R,0,0,0,-1/n R,0] (n R = # y i 2 R) take new set of examples, project to this space, and run good L 1 alg (Winnow). –So, take new set of examples, project to this space, and run good L 1 alg (Winnow). could be unlabeled E y [ l (x) l (y)K(x,y)|y 2 R] ¸ 

Interesting property Notice that using landmarks to define an explicit Data-dependent concept space: –F(x) = [K(x,y 1 ), …,K(x,y n )]. –Considering linear separators of large L 1 margin in this space. (which again is data-dependent)

Relation between “large margin kernel” and “good similarity fn” Any large margin kernel is also good similarity function (though can potentially incur quadratic penalty in sample cplx). But, can also get exponential improvement. Can define C, D with –Similarity fn (0,1,1/|C|)-good for all c i 2 C. –But any kernel has margin O(1/|C| 1/2 ) for some c 2 C hinge lossmarginPr(R i ) Gives bound O(  -1 log(|C|))

Summary Defined property of similarity fn that is sufficient to be able to learn well. –Can apply to similarity fns that aren’t positive- semidefinite (or even symmetric). –Includes usual notion of “good kernels” modulo some loss in parameters, but also potential gain. –Uses data-dependent concept spaces. Open questions: properties suggesting other algs? E.g., [WangYangFeng07] Large margin kernel defn 1 Main property