Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.

Slides:



Advertisements
Similar presentations
AUTOMATIC GLOSS FINDING for a Knowledge Base using Ontological Constraints Bhavana Dalvi (PhD Student, LTI) Work done with: Prof. William Cohen, CMU Prof.
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Semi-Supervised Learning With Graphs William Cohen 1.
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Supervised Learning Recap
Nonlinear Unsupervised Feature Learning How Local Similarities Lead to Global Coding Amirreza Shaban.
Classification and Decision Boundaries
K nearest neighbor and Rocchio algorithm
Sparse vs. Ensemble Approaches to Supervised Learning
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Sparse vs. Ensemble Approaches to Supervised Learning
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.
Radial Basis Function Networks
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Active Learning for Class Imbalance Problem
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
Machine Learning.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Some questions -What is metadata? -Data about data.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Logistic Regression William Cohen.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Semi-Supervised Learning With Graphs William Cohen.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Data Mining and Text Mining. The Standard Data Mining process.
Graph-based WSD の続き DMLA /7/10 小町守.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Big data classification using neural network
SSL (on graphs).
Semi-Supervised Clustering
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Week 1.
Instance Based Learning
Text Categorization Berlin Chen 2003 Reference:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Semi-Supervised Learning William Cohen

Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression with entropic regularization – Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

INTRO TO SEMI-SUPERVISED LEARNING (SSL)

Semi-supervised learning Given: – A pool of labeled examples L – A (usually larger ) pool of unlabeled examples U Option 1 for using L and U : – Ignore U and use supervised learning on L Option 2: – Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: – Can you use both L and U to do better?

SSL is Somewhere Between Clustering and Supervised Learning 5

SSL is Between Clustering and SL 6

What is a natural grouping among these objects? slides: Bhavana Dalvi

SSL is Between Clustering and SL 8 clustering is unconstrained and may not give you what you want maybe this clustering is as good as the other

SSL is Between Clustering and SL 9

10

SSL is Between Clustering and SL 11 supervised learning with few labels is also unconstrained and may not give you what you want

SSL is Between Clustering and SL 12

SSL is Between Clustering and SL 13 This clustering isn’t consistent with the labels

SSL is Between Clustering and SL 14 |Predicted Green|/|U| ~= 50%

SSL in Action: The NELL System

Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression with entropic regularization – Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

TRANSDUCTIVE SVM

Two Kinds of Learning Inductive SSL: – Input: training set (x 1,y 1 ),…,(x n,y n ) x n+1, x n+2,…,x n+m – Output: classifier f(x) = y – Classifier can be run on any test example x Transductive SSL: – Input: training set (x 1,y 1 ),…,(x n,y n ) x n+1, x n+2,…,x n+m – Output: classifier f(x i ) = y – Classifier is only defined for x i ’s seen at training time

Standard SVM

Tranductive SVM

Not a convex problem – need to do some sort of search to guess the labels for the unlabeled examples

SSL using regularized SGD for logistic regression 1.P (y|x)=logistic(x. w) 2.Define loss function 3.Differentiate the function and use gradient descent to learn 23

SSL using regularized SGD for logistic regression 1.P (y|x)=logistic(x. w) 2.Define loss function 3.Differentiate the function and use gradient descent to learn 24 Entropy of predictions on the unlabeled examples

Again, a convex problem – need to do some sort of search to guess the labels for the unlabeled examples Logistic regression with entropic regularization Low entropy example: very confident in one class High entropy example: high probability of being in either class

SEMI-SUPERVISED K-MEANS AND MIXTURE MODELS

k-means

K-means Clustering: Step 1 28

K-means Clustering: Step 2 29

K-means Clustering: Step 3 30

K-means Clustering: Step 4 31

K-means Clustering: Step 5 32

Seeded k-means k is the number of classes using the labeled “seed” data except keep the seeds in the class they are known to belong to

Basu and Mooney ICML 2002 Unsupervised k-means Seeded k-means (constrained k-means) Just use labeled data to initialize the clusters Some old baseline I’m not going to even talk about 20 Newsgroups dataset

Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression with entropic regularization – Generative: seeded k-means Some recent extensions…. – Nearest-neighbor like: graph-based SSL

Seeded k-means for a hierarchical classification tasks 37 Simple extension: 1.Don’t assign to one of K classes: instead make a decision about every class in the ontology example  {1,..K} example  Pick “closest” bit vector consistent with constraints this is an (ontology-sized) optimization problem that you solve independently for each example Bit vector with one bit for each category

Seeded k-means k is the number of classes using the labeled “seed” data except keep the seeds in the classes they are known to belong to best consistent set of categories from the ontology

Automatic Gloss Finding for a Knowledge Base 39 Glosses: Natural language definitions of named entities. E.g. “Microsoft” is an American multinational corporation headquartered in Redmond that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services... – Input: Knowledge Base i.e. a set of concepts (e.g. company) and entities belonging to those concepts (e.g. Microsoft), and a set of potential glosses. – Output: Candidate glosses matched to relevant entities in the KB. “Microsoft is an American multinational corporation headquartered in Redmond …” is mapped to entity “Microsoft” of type “Company”. [Automatic Gloss Finding for a Knowledge Base using Ontological Constraints, Bhavana Dalvi Mishra, Einat Minkov, Partha Pratim Talukdar, and William W. Cohen, 2014, Under submission]

Example: Gloss finding 40

Example: Gloss finding 41

Example: Gloss finding 42

Example: Gloss finding 43

Training a clustering model Labeled: Unambiguous Unlabeled: Ambiguous 44 Fruit Company

GLOFIN: Clustering glosses 45

GLOFIN: Clustering glosses 46

GLOFIN: Clustering glosses 47

GLOFIN: Clustering glosses 48

GLOFIN: Clustering glosses 49

GLOFIN: Clustering glosses 50

GLOFIN on NELL Dataset categories, 247K candidate glosses, #train=20K, #test=227K

Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression with entropic regularization – Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

Harmonic fields – Gharamani, Lafferty and Zhu Observed label Idea: construct a graph connecting the most similar examples (k-NN graph) Intuition: nearby points should have similar labels – labels should “propagate” through the graph Formalization: try and minimize “energy” defined as: In this example y is a length- 10 vector

Harmonic fields – Gharamani, Lafferty and Zhu Observed label Result 1: at the minimal energy state, each node’s value is a weighted average of its neighbor’s weights:

“Harmonic field” LP algorithm Result 2: you can reach the minimal energy state with a simple iterative algorithm: – Step 1: For each seed example (x i,y i ): Let V 0 (i,c) = [| y i = c |] – Step 2: for t=1,…,T --- T is about 5 Let V t+1 (i,c) =weighted average of V t+1 (j,c) for all j that are linked to i, and renormalize For seeds, reset V t+1 (i,c) = [| y i = c |]

This family of techniques is called “Label propagation” Harmonic fields – Gharamani, Lafferty and Zhu

This family of techniques is called “Label propagation” Harmonic fields – Gharamani, Lafferty and Zhu This experiment points out some of the issues with LP: 1.What distance metric do you use? 2.What energy function do you minimize? 3.What is the right value for K in your K-NN graph? Is a K-NN graph right? 4.If you have lots of data, how expensive is it to build the graph?

NELL: Uses Co-EM ~= HF Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 anxiety selfishness Berlin Extract cities: Examples Features

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness

Difference: graph construction is not instance-to- instance but instance-to-feature Paris San Francisco Austin anxiety Pittsburgh Seattle denial selfishness

Some other general issues with SSL How much unlabeled data do you want? – Suppose you’re optimizing J = J L (L) + J U (U) – If |U| >> |L| does J U dominate J? If so you’re basically just clustering – Often we need to balance J L and J U Besides L, what other information about the task is useful (or necessary)? – Common choice: relative frequency of classes – Various ways of incorporating this into the optimization problem

Key and not-so-key points The general idea : what is SSL and when do you want to use it? – NELL as an example of SSL Different SSL methods: – margin-based approach: start with a supervised learner transductive SVM: what’s optimized and why logistic reg with entropic regularization – k-means versus seeded k-means: start with clustering The core algorithm and what Extension to hierarchical case and GLOFIN – nearest-neighbor like: graph-based SSL and LP The HF algorithm and the energy function being minimized