Semi-Supervised Learning

Slides:



Advertisements
Similar presentations
Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]
Advertisements

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Co Training Presented by: Shankar B S DMML Lab
Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.
Online learning, minimizing regret, and combining expert advice
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Machine Learning Week 2 Lecture 1.
A general agnostic active learning algorithm
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Machine Learning Week 2 Lecture 2.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Active Perspectives on Computational Learning and Testing Liu Yang Slide 1.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.
Probably Approximately Correct Model (PAC)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Incorporating Unlabeled Data in the Learning Process
Semi-Supervised Learning
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning CSE 681 CH2 - Supervised Learning.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
Combining Labeled and Unlabeled Data with Co-Training
Semi-Supervised Learning
Computational Learning Theory
Computational Learning Theory
Maria Florina Balcan 03/04/2010
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Supervised machine learning: creating a model
Semi-Supervised Learning
Presentation transcript:

Semi-Supervised Learning Maria-Florina Balcan Maria-Florina Balcan

Supervised Learning: Formalization (PAC) X - instance space Sl={(xi, yi)} - labeled examples drawn i.i.d. from some distr. D over X and labeled by some target concept c* labels 2 {-1,1} - binary classification Algorithm A PAC-learns concept class C if for any target c* in C, any distrib. D over X, any ,  > 0: - A uses at most poly(n,1/,1/,size(c*)) examples and running time. - With probab. 1-, A produces h in C of error at · . Maria-Florina Balcan

Supervised Learning, Big Questions Algorithm Design How might we automatically generate rules that do well on observed data? Sample Complexity/Confidence Bound What kind of confidence do we have that they will do well in the future? Maria-Florina Balcan

Sample Complexity: Uniform Convergence Finite Hypothesis Spaces Realizable Case Agnostic Case What if there is no perfect h? Maria-Florina Balcan

Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces C[S] – the set of splittings of dataset S using concepts from C. C[m] - maximum number of ways to split m points using concepts in C; i.e. C[m,D] - expected number of splits of m points from D with concepts in C. Fact #1: previous results still hold if we replace |C| with C[2m]. Fact #2: can even replace with C[2m,D]. Maria-Florina Balcan

Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces For instance: Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies: Maria-Florina Balcan

Sample Complexity: -Cover Bounds C is an -cover for C w.r.t. D if for every h 2 C there is a h’ 2 C which is -close to h. To learn, it’s enough to find an -cover and then do empirical risk minimization w.r.t. the functions in this cover. In principle, in the realizable case, the number of labeled examples we need is Usually, for fixed distributions. Maria-Florina Balcan

Sample Complexity: -Cover Bounds Can be much better than Uniform-Convergence bounds! Simple Example (Realizable case) X={1, 2, …,n}, C =C1 [ C2, D= uniform over X. C1 - the class of all functions that predict positive on at most ¢ n/4 examples. C2 - the class of all functions that predict negative on at most  ¢ n/4 examples. If the number of labeled examples ml <  ¢ n/4, don’t have uniform convergence yet. The size of the smallest /4-cover is 2, so we can learn with only O(1/) labeled examples. In fact, since the elements of this cover are far apart, much fewer examples are sufficient. Maria-Florina Balcan

Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Take a sample S of data, labeled according to whether they were/weren't spam. Protein sequences Billions of webpages Images

Semi-Supervised Learning Unlabeled data raw data Learning Algorithm Expert Labeler face not face Labeled data Classifier 10 10

Semi-Supervised Learning Hot topic in recent years in Machine Learning. Many applications have lots of unlabeled data, but labeled data is rare or expensive: Web page, document classification OCR, Image classification Workshops [ICML ’03, ICML’ 05] Books: Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds) Maria-Florina Balcan

Combining Labeled and Unlabeled Data Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] Augmented PAC model for SSL [Balcan & Blum ’05] Su={xi} - unlabeled examples drawn i.i.d. from D Sl={(xi, yi)} – labeled examples drawn i.i.d. from D and labeled by some target concept c*. Different model: the learner gets to pick the examples to Labeled – Active Learning. Maria-Florina Balcan

Can we extend the PAC/SLT models to deal with Unlabeled Data? PAC/SLT models – nice/standard models for learning from labeled data. Goal – extend them naturally to the case of learning from both labeled and unlabeled data. Different algorithms are based on different assumptions about how data should behave. Question – how to capture many of the assumptions typically used? Maria-Florina Balcan

Example of “typical” assumption: Margins The separator goes through low density regions of the space/large margin. assume we are looking for linear separator belief: should exist one with large separation + _ Labeled data only Transductive SVM SVM Maria-Florina Balcan

Another Example: Self-consistency Agreement between two parts : co-training. examples contain two sufficient sets of features, i.e. an example is x=h x1, x2 i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) for example, if we want to classify web pages: x = h x1, x2 i My Advisor Prof. Avrim Blum x2- Text info x1- Link info x - Link info & Text info Maria-Florina Balcan

Iterative Co-Training + X1 X2 Works by using unlabeled data to propagate learned information. h1 h Have learning algos A1, A2 on each of the two views. Use labeled data to learn two initial hyp. h1, h2. Look through unlabeled data to find examples where one of hi is confident but other is not. Have the confident hi label it for algorithm A3-i. Repeat Maria-Florina Balcan

Iterative Co-Training A Simple Example: Learning Intervals Labeled examples Unlabeled examples c2 c1 + h21 - - h11 Use labeled data to learn h11 and h21 Use unlabeled data to bootstrap h12 h21 h12 h22 Maria-Florina Balcan

Co-training: Theoretical Guarantees What properties do we need for co-training to work well? We need assumptions about: the underlying data distribution the learning algorithms on the two sides [Blum & Mitchell, COLT ‘98] Independence given the label Alg. for learning from random noise. [Balcan, Blum, Yang, NIPS 2004] Distributional expansion. Alg. for learning from positve data only. Maria-Florina Balcan

Problems thinking about SSL in the PAC model PAC model talks of learning a class C under (known or unknown) distribution D. Not clear what unlabeled data can do for you. Doesn’t give you any info about which c 2 C is the target function. Can we extend the PAC model to capture these (and more) uses of unlabeled data? Give a unified framework for understanding when and why unlabeled data can help. Maria-Florina Balcan

New discriminative model for SSL Su={xi} - xi i.i.d. from D and Sl={(xi, yi)} –xi i.i.d. from D, yi =c*(xi). Problems with thinking about SSL in standard WC models PAC or SLT: learn a class C under (known or unknown) distribution D. a complete disconnect between the target and D Unlabeled data doesn’t give any info about which c 2 C is the target. Key Insight Under what conditions will unlabeled data help and by how much? How much data should I expect to need in order to perform well? Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.

New model for SSL, Main Ideas Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. “learn C” becomes “learn (C,)” (learn class C under ) Express relationships that target and underlying distr. possess. + _ Idea I: use unlabeled data & belief that target is compatible to reduce C down to just {the highly compatible functions in C}. abstract prior  Class of fns C unlabeled data Compatible fns in C e.g., linear separators e.g., large margin linear separators finite sample Idea II: degree of compatibility estimated from a finite sample.

Formally Idea II: degree of compatibility estimated from a finite sample. Require compatibility (h,D) to be expectation over individual examples. (don’t need to be so strict but this is cleanest) (h,D)=Ex2 D[(h, x)] compatibility of h with D, (h,x)2 [0,1] View incompatibility as unlabeled error rate errunl(h)=1-(h, D) incompatibility of h with D

Margins, Compatibility Margins: belief is that should exist a large margin separator. Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance  of h. Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: (h,x)=0 if dist(x,h) ·  (h,x)=1 if dist(x,h) ¸  Highly compatible + _ Maria-Florina Balcan

Margins, Compatibility Margins: belief is that should exist a large margin separator. If do not want to commit to  in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: Illegal notion of compatibility: the largest  s.t. D has probability mass exactly zero within distance  of h. Highly compatible + _ Maria-Florina Balcan

Co-Training, Compatibility Co-training: examples come as pairs h x1, x2 i and the goal is to learn a pair of functions h h1, h2 i Hope is that the two parts of the example are consistent. Legal (and natural) notion of compatibility: the compatibility of h h1, h2 i and D: can be written as an expectation over examples: Maria-Florina Balcan

Types of Results in the [BB05] Model As in the usual PAC model, can discuss algorithmic and sample complexity issues. Sample Complexity issues that we can address: How much unlabeled data we need: depends both on the complexity of C and the complexity of our notion of compatibility. Ability of unlabeled data to reduce number of labeled examples needed: compatibility of the target (various measures of) the helpfulness of the distribution Give both uniform convergence bounds and epsilon-cover based bounds. Maria-Florina Balcan

Examples of results: Sample Complexity - Uniform convergence bound Finite Hypothesis Spaces, Doubly Realizable Case Define CD,() = {h 2 C : errunl(h) · }. Theorem Bound the # of labeled examples as a measure of the helpfulness of D with respect to  a helpful distribution is one in which CD,() is small Maria-Florina Balcan

Examples of results: Sample Complexity - Uniform convergence bound Simple algorithm: pick a compatible concept that agrees with the labeled sample. Highly compatible + _ Maria-Florina Balcan

Sample Complexity, Uniform Convergence Bounds Compatible fns in C CD,() = {h 2 C :errunl(h) ·} Proof Probability that h with errunl(h)> ² is compatible with Su is (1-²)mu · ±/(2|C|) Finite Hypothesis Spaces, Doubly Realizable Case a helpful distribution is one in which CD, () is small By union bound, prob. 1-±/2 only hyp in CD,() are compatible with Su ml large enough to ensure that none of fns in CD,() with err(h) ¸ ² have an empirical error rate of 0.

Sample Complexity, Uniform Convergence Bounds Compatible fns in C CD,() = {h 2 C :errunl(h) ·} Bound # of labeled examples as a measure of the helpfulness of D wrt  Finite Hypothesis Spaces, Doubly Realizable Case a helpful distribution is one in which CD, () is small helpful D is one in which CD, () is small

Sample Complexity, Uniform Convergence Bounds Compatible fns in C Helpful distribution Non-helpful distribution Highly compatible + _ 1/°2 clusters, all partitions separable by large margin Finite Hypothesis Spaces, Doubly Realizable Case a helpful distribution is one in which CD, () is small

Examples of results: Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c* not fully compatible: Theorem Maria-Florina Balcan

Examples of results: Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x). C[m,D] - expected # of splits of m points from D with concepts in C. Maria-Florina Balcan

Examples of results: Sample Complexity - Uniform convergence bounds For S µ X, denote by US the uniform distribution over S, and by C[m, US] the expected number of splits of m points from US with concepts in C. Assume err(c*)=0 and errunl(c*)=0. Theorem The number of labeled examples depends on the unlabeled sample. Useful since can imagine the learning alg. performing some calculations over the unlabeled data and then deciding how many labeled examples to purchase. Maria-Florina Balcan

Sample Complexity Subtleties Uniform Convergence Bounds Depends both on the complexity of C and on the complexity of  Distr. dependent measure of complexity furthermore all $h \in C$ have $$|err_{unl}(h)-\widehat{err}_{unl}(h)| \leq \varepsilon$$ Highly compatible + _ -Cover bounds much better than Uniform Convergence bounds. For algorithms that behave in a specific way: first use the unlabeled data to choose a representative set of compatible hypotheses then use the labeled sample to choose among these

Examples of results: Sample Complexity, -Cover-based bounds For algorithms that behave in a specific way: first use the unlabeled data to choose a representative set of compatible hypotheses then use the labeled sample to choose among these Theorem Can result in much better bound than uniform convergence! Maria-Florina Balcan

Implications of the [BB05] analysis Ways in which unlabeled data can help If c* is highly compatible with D and have enough unlabeled data to estimate  over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (e.g., Annealed VC-entropy or the size of the smallest -cover). If D is nice so that the set of compatible h 2 C has a small -cover and the elements of the cover are far apart, then can learn from even fewer labeled examples than the 1/ needed just to verify a good hypothesis. Maria-Florina Balcan