Semi-Supervised Learning

Slides:

Advertisements

Similar presentations

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Advertisements

Interactively Co-segmentating Topically Related Images with Intelligent Scribble Guidance Dhruv Batra, Carnegie Mellon University Adarsh Kowdle, Cornell.

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

An Overview of Machine Learning

Recommender Systems Problem formulation Machine Learning.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Sparse vs. Ensemble Approaches to Supervised Learning

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Anomaly detection with Bayesian networks Website: John Sandiford.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.

NTU & MSRA Ming-Feng Tsai

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Data Mining Practical Machine Learning Tools and Techniques

Neural networks and support vector machines

Recent Trends in Text Mining

Who am I? Work in Probabilistic Machine Learning Like to teach 

Machine Learning Models

CS 445/545 Machine Learning Winter, 2017

Semi-Supervised Clustering

Machine Learning – Classification David Fenyő

ECE 5424: Introduction to Machine Learning

Machine Learning & Deep Learning

Semi-supervised Machine Learning Gergana Lazarova

Constrained Clustering -Semi Supervised Clustering-

CS 445/545 Machine Learning Spring, 2017

cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14

Generative Adversarial Networks

Multimodal Learning with Deep Boltzmann Machines

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Unsupervised Learning and Autoencoders

Using Transductive SVMs for Object Classification in Images

Machine Learning Week 1.

INF 5860 Machine learning for image classification

Overview of Machine Learning

Support Vector Machine I

Concave Minimization for Support Vector Machine Classifiers

Multivariate Methods Berlin Chen, 2005 References:

Machine Learning Support Vector Machine Supervised Learning

Machine Learning – a Probabilistic Perspective

Jia-Bin Huang Virginia Tech

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Derek Hoiem CS 598, Spring 2009 Jan 27, 2009

Usman Roshan Dept. of Computer Science NJIT

Introduction to Neural Networks

Jia-Bin Huang Virginia Tech

Minimal Kernel Classifiers

Jia-Bin Huang Virginia Tech

Recommender Systems Problem formulation Machine Learning.

What is Artificial Intelligence?

Presentation transcript:

Semi-Supervised Learning Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative HW 4 due April 10

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Problem motivation 5 0.9 ? 1.0 0.01 4 0.99 0.1 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 0.9 Romance forever ? 1.0 0.01 Cute puppies of love 4 0.99 Nonstop car chases 0.1 Swords vs. karate

Problem motivation 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 𝑥 1 = ? ? ?

Optimization algorithm Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (𝑖) : min 𝑥 (𝑖) 1 2 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) : min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 (and movie ratings), Can estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Can estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚

Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Minimize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 and 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 simultaneously 𝐽= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering optimization objective 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 )= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering algorithm Initialize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 to small random values Minimize 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗= 1⋯ 𝑛 𝑢 , 𝑖=1, ⋯, 𝑛 𝑚 : 𝑥 𝑘 𝑗 ≔ 𝑥 𝑘 𝑗 −𝛼 𝑗:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃 𝑘 𝑖 +𝜆 𝑥 𝑘 (𝑖) 𝜃 𝑘 𝑗 ≔ 𝜃 𝑘 𝑗 −𝛼 𝑖:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥 𝑘 𝑖 +𝜆 𝜃 𝑘 (𝑗) For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃 ⊤ 𝑥

Collaborative filtering Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate

Collaborative filtering Predicted ratings: 𝑋= − 𝑥 1 ⊤ − − 𝑥 2 ⊤ − ⋮ − 𝑥 𝑛 𝑚 ⊤ − Θ= − 𝜃 1 ⊤ − − 𝜃 2 ⊤ − ⋮ − 𝜃 𝑛 𝑢 ⊤ − Y=X Θ ⊤ Low-rank matrix factorization

Finding related movies/products For each product 𝑖, we learn a feature vector 𝑥 (𝑖) ∈ 𝑅 𝑛 𝑥 1 : romance, 𝑥 2 : action, 𝑥 3 : comedy, … How to find movie 𝑗 relate to movie 𝑖? Small 𝑥 (𝑖) − 𝑥 (𝑗) movie j and I are “similar”

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Mean normalization Learn 𝜃 (𝑗) , 𝑥 (𝑖) For user 𝑗, on movie 𝑖 predict: 𝜃 𝑗 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 User 5 (Eve): 𝜃 5 = 0 0 𝜃 5 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 Learn 𝜃 (𝑗) , 𝑥 (𝑖)

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Review: Supervised Learning K nearest neighbor Linear Regression Naïve Bayes Logistic Regression Support Vector Machines Neural Networks

Review: Unsupervised Learning Clustering, K-Mean Expectation maximization Dimensionality reduction Anomaly detection Recommendation system

Advanced Topics Semi-supervised learning Probabilistic graphical models Generative models Sequence prediction models Deep reinforcement learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts Protein sequences Billions of webpages Images

Semi-supervised Learning

Active Learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Semi-supervised Learning Problem Formulation Labeled data 𝑆 𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑙 , 𝑦 𝑚 𝑙 Unlabeled data 𝑆 𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑢 , 𝑦 𝑚 𝑢 Goal: Learn a hypothesis ℎ 𝜃 (e.g., a classifier) that has small error

Combining labeled and unlabeled data - Classical methods Transductive SVM [Joachims ’99] Co-training [Blum and Mitchell ’98] Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]

Transductive SVM The separator goes through low density regions of the space / large margin

SVM Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) Inputs: s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) , 𝑥 u (𝑖) , 𝑦 𝑢 (𝑖) min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 𝑦 u (𝑖) 𝜃 ⊤ 𝑥 𝑖 ≥1 𝑦 u 𝑖 ∈{−1, 1}

Transductive SVMs First maximize margin over the labeled points Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin

Deep Semi-supervised Learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Stochastic Perturbations/Π-Model Realistic perturbations 𝑥→ 𝑥 of data points 𝑥∈ 𝐷 𝑈𝐿 should not significantly change the output of h 𝜃 (𝑥)

Temporal Ensembling

Mean Teacher

Virtual Adversarial Training

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

EntMin Encourages more confident predictions on unlabeled data.

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Comparison

Varying number of labels

Class mismatch in Labeled/Unlabeled datasets hurts the performance

Lessons Standardized architecture + equal budget for tuning hyperparameters Unlabeled data from a different class distribution not that useful Most methods don’t work well in the very low labeled-data regime Transferring Pre-Trained Imagenet produces lower error rate Conclusions based on small datasets though