Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Slides:

Advertisements

Similar presentations

Self-Paced Learning for Semantic Segmentation

Advertisements

Learning Shared Body Plans Ian Endres University of Illinois work with Derek Hoiem, Vivek Srikumar and Ming-Wei Chang.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Curriculum Learning for Latent Structural SVM

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.

Loss-based Visual Learning with Weak Supervision M. Pawan Kumar Joint work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura.

Max-Margin Latent Variable Models M. Pawan Kumar.

Learning Structural SVMs with Latent Variables Xionghao Liu.

Discrete Optimization for Vision and Learning. Who? How? M. Pawan Kumar Associate Professor Ecole Centrale Paris Nikos Komodakis Associate Professor Ecole.

Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.

Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.

Multiplicative Bounds for Metric Labeling M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Phil.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.

School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.

Gaussian process modelling

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Self-paced Learning for Latent Variable Models

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.

Model representation Linear regression with one variable

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Object Detection with Discriminatively Trained Part Based Models

Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Lecture 2: Statistical learning primer for biologists

Learning from Big Data Lecture 5

Logistic Regression William Cohen.

Machine Learning 5. Parametric Methods.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

EE 551/451, Fall, 2006 Communication Systems Zhu Han Department of Electrical and Computer Engineering Class 15 Oct. 10 th, 2006.

Computacion Inteligente Least-Square Methods for System Identification.

Optimizing Average Precision using Weakly Supervised Data Aseem Behl 1, C.V. Jawahar 1 and M. Pawan Kumar 2 1 IIIT Hyderabad, India, 2 Ecole Centrale Paris.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Rounding-based Moves for Metric Labeling M. Pawan Kumar École Centrale Paris INRIA Saclay, Île-de-France.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Lecture 3: Linear Regression (with One Variable)

Ch3: Model Building through Regression

CSE 4705 Artificial Intelligence

Generalization and adaptivity in stochastic convex optimization

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.

Group Norm for Learning Latent Structural SVMs

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli

10701 / Machine Learning Today: - Cross validation,

I. Statistical Tests: Why do we use them? What do they involve?

Additional notes on random variables

Additional notes on random variables

Presentation transcript:

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France

Aim Accurate learning with weakly supervised data Train Input x i Output y i Bison Deer Elephant Giraffe Llama Rhino Object Detection Input x Output y = “Deer” Latent Variable h

(y(f),h(f)) = argmax y,h f(Ψ(x,y,h)) Aim Accurate learning with weakly supervised data Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Prediction Function f : Ψ(x,y,h)  (-∞, +∞) Latent Variable h

f* = argmin f Objective(f) Aim Accurate learning with weakly supervised data Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Function f : Ψ(x,y,h)  (-∞, +∞) Learning Latent Variable h

Aim Find a suitable objective function to learn f* Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Function f : Ψ(x,y,h)  (-∞, +∞) Learning Encourages accurate prediction User-specified criterion for accuracy f* = argmin f Objective(f) Latent Variable h

Previous Methods Our Framework Optimization Results Ongoing and Future Work Outline

Latent SVM Linear function parameterized by w Prediction(y(w), h(w)) = argmax y,h w T Ψ(x,y,h) Learningmin w Σ i Δ(y i,y i (w),h i (w)) ✔ Loss based learning ✖ Loss independent of true (unknown) latent variable ✖ Doesn’t model uncertainty in latent variables User-defined loss

Expectation Maximization Joint probability P θ (y,h|x) = exp(θ T Ψ(x,y,h)) Z Prediction(y(θ), h(θ)) = argmax y,h P θ (y,h|x)

Expectation Maximization Joint probability P θ (y,h|x) = exp(θ T Ψ(x,y,h)) Z Prediction(y(θ), h(θ)) = argmax y,h θ T Ψ(x,y,h) Learningmax θ Σ i log (P θ (y i |x i ))

Expectation Maximization Joint probability P θ (y,h|x) = exp(θ T Ψ(x,y,h)) Z Prediction(y(θ), h(θ)) = argmax y,h θ T Ψ(x,y,h) Learningmax θ Σ i Σ h i log (P θ (y i,h i |x i )) ✔ Models uncertainty in latent variables ✖ Doesn’t model accuracy of latent variable prediction ✖ No user-defined loss function

Previous Methods Our Framework Optimization Results Ongoing and Future Work Outline

Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

Previous Methods Our Framework Optimization Results Ongoing and Future Work Outline

Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Observation: When Δ is independent of true h, our framework is equivalent to Latent SVM Observation: When Δ is independent of true h, our framework is equivalent to Latent SVM Concave-Convex Procedure (CCCP)

Previous Methods Our Framework Optimization Results Ongoing and Future Work Outline

Object Detection Bison Deer Elephant Giraffe Llama Rhino Input x Output y = “Deer” Latent Variable h Mammals Dataset 60/40 Train/Test Split 5 Folds Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss

Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC /40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss Statistically Significant

Previous Methods Our Framework Optimization Results Ongoing and Future Work Outline

Slides Deleted !!!