Semi-Supervised Learning Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019
Administrative HW 4 due April 10
Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization
Problem motivation 5 0.9 ? 1.0 0.01 4 0.99 0.1 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 0.9 Romance forever ? 1.0 0.01 Cute puppies of love 4 0.99 Nonstop car chases 0.1 Swords vs. karate
Problem motivation 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 𝑥 1 = ? ? ?
Optimization algorithm Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (𝑖) : min 𝑥 (𝑖) 1 2 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) : min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2
Collaborative filtering Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 (and movie ratings), Can estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Can estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚
Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2
Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Minimize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 and 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 simultaneously 𝐽= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2
Collaborative filtering optimization objective 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 )= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2
Collaborative filtering algorithm Initialize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 to small random values Minimize 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗= 1⋯ 𝑛 𝑢 , 𝑖=1, ⋯, 𝑛 𝑚 : 𝑥 𝑘 𝑗 ≔ 𝑥 𝑘 𝑗 −𝛼 𝑗:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃 𝑘 𝑖 +𝜆 𝑥 𝑘 (𝑖) 𝜃 𝑘 𝑗 ≔ 𝜃 𝑘 𝑗 −𝛼 𝑖:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥 𝑘 𝑖 +𝜆 𝜃 𝑘 (𝑗) For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃 ⊤ 𝑥
Collaborative filtering Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate
Collaborative filtering Predicted ratings: 𝑋= − 𝑥 1 ⊤ − − 𝑥 2 ⊤ − ⋮ − 𝑥 𝑛 𝑚 ⊤ − Θ= − 𝜃 1 ⊤ − − 𝜃 2 ⊤ − ⋮ − 𝜃 𝑛 𝑢 ⊤ − Y=X Θ ⊤ Low-rank matrix factorization
Finding related movies/products For each product 𝑖, we learn a feature vector 𝑥 (𝑖) ∈ 𝑅 𝑛 𝑥 1 : romance, 𝑥 2 : action, 𝑥 3 : comedy, … How to find movie 𝑗 relate to movie 𝑖? Small 𝑥 (𝑖) − 𝑥 (𝑗) movie j and I are “similar”
Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization
Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0
Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0
Mean normalization Learn 𝜃 (𝑗) , 𝑥 (𝑖) For user 𝑗, on movie 𝑖 predict: 𝜃 𝑗 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 User 5 (Eve): 𝜃 5 = 0 0 𝜃 5 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 Learn 𝜃 (𝑗) , 𝑥 (𝑖)
Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization
Review: Supervised Learning K nearest neighbor Linear Regression Naïve Bayes Logistic Regression Support Vector Machines Neural Networks
Review: Unsupervised Learning Clustering, K-Mean Expectation maximization Dimensionality reduction Anomaly detection Recommendation system
Advanced Topics Semi-supervised learning Probabilistic graphical models Generative models Sequence prediction models Deep reinforcement learning
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts Protein sequences Billions of webpages Images
Semi-supervised Learning
Active Learning
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
Semi-supervised Learning Problem Formulation Labeled data 𝑆 𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑙 , 𝑦 𝑚 𝑙 Unlabeled data 𝑆 𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑢 , 𝑦 𝑚 𝑢 Goal: Learn a hypothesis ℎ 𝜃 (e.g., a classifier) that has small error
Combining labeled and unlabeled data - Classical methods Transductive SVM [Joachims ’99] Co-training [Blum and Mitchell ’98] Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]
Transductive SVM The separator goes through low density regions of the space / large margin
SVM Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) Inputs: s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) , 𝑥 u (𝑖) , 𝑦 𝑢 (𝑖) min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 𝑦 u (𝑖) 𝜃 ⊤ 𝑥 𝑖 ≥1 𝑦 u 𝑖 ∈{−1, 1}
Transductive SVMs First maximize margin over the labeled points Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin
Deep Semi-supervised Learning
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
Stochastic Perturbations/Π-Model Realistic perturbations 𝑥→ 𝑥 of data points 𝑥∈ 𝐷 𝑈𝐿 should not significantly change the output of h 𝜃 (𝑥)
Temporal Ensembling
Mean Teacher
Virtual Adversarial Training
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
EntMin Encourages more confident predictions on unlabeled data.
Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling
Comparison
Varying number of labels
Class mismatch in Labeled/Unlabeled datasets hurts the performance
Lessons Standardized architecture + equal budget for tuning hyperparameters Unlabeled data from a different class distribution not that useful Most methods don’t work well in the very low labeled-data regime Transferring Pre-Trained Imagenet produces lower error rate Conclusions based on small datasets though