Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Slides:

Advertisements

Similar presentations

Self-Paced Learning for Semantic Segmentation

Advertisements

Learning Shared Body Plans Ian Endres University of Illinois work with Derek Hoiem, Vivek Srikumar and Ming-Wei Chang.

Curriculum Learning for Latent Structural SVM

ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford.

Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.

Loss-based Visual Learning with Weak Supervision M. Pawan Kumar Joint work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura.

Max-Margin Latent Variable Models M. Pawan Kumar.

Learning Structural SVMs with Latent Variables Xionghao Liu.

Discrete Optimization for Vision and Learning. Who? How? M. Pawan Kumar Associate Professor Ecole Centrale Paris Nikos Komodakis Associate Professor Ecole.

Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.

Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.

Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.

School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.

Gaussian process modelling

Efficient Model Selection for Support Vector Machines

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Self-paced Learning for Latent Variable Models

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.

1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.

Model representation Linear regression with one variable

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Object Detection with Discriminatively Trained Part Based Models

Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

By Sharath Kumar Aitha. Instructor: Dr. Dongchul Kim.

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Recognition Using Visual Phrases

Learning from Big Data Lecture 5

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Logistic Regression William Cohen.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Computacion Inteligente Least-Square Methods for System Identification.

Optimizing Average Precision using Weakly Supervised Data Aseem Behl 1, C.V. Jawahar 1 and M. Pawan Kumar 2 1 IIIT Hyderabad, India, 2 Ecole Centrale Paris.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Boosting and Additive Trees (2)

Lecture 3: Linear Regression (with One Variable)

CSE 4705 Artificial Intelligence

Generalization and adaptivity in stochastic convex optimization

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.

Group Norm for Learning Latent Structural SVMs

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli

I. Statistical Tests: Why do we use them? What do they involve?

Machine learning overview

Linear regression with one variable

THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU

Presentation transcript:

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France

Aim Accurate learning with weakly supervised data Train Input x i Output y i Bison Deer Elephant Giraffe Llama Rhino Object Detection Input x Output y = “Deer” Latent Variable h

(y(f),h(f)) = argmax y,h f(Ψ(x,y,h)) Aim Accurate learning with weakly supervised data Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Prediction Function f : Ψ(x,y,h)  (-∞, +∞) Latent Variable h

f* = argmin f Objective(f) Aim Accurate learning with weakly supervised data Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Function f : Ψ(x,y,h)  (-∞, +∞) Learning Latent Variable h

Aim Find a suitable objective function to learn f* Feature Ψ(x,y,h) (e.g. HOG) Input x Output y = “Deer” Function f : Ψ(x,y,h)  (-∞, +∞) Learning Encourages accurate prediction User-specified criterion for accuracy f* = argmin f Objective(f) Latent Variable h

Previous Methods Our Framework Optimization Results Outline

Latent SVM Linear function parameterized by w Prediction(y(w), h(w)) = argmax y,h w T Ψ(x,y,h) Learningmin w Σ i Δ(y i,y i (w),h i (w)) ✔ Loss based learning ✖ Loss function has a restricted form ✖ Doesn’t model uncertainty in latent variables

Expectation Maximization Joint probability P θ (y,h|x) = exp(θ T Ψ(x,y,h)) Z Prediction(y(θ), h(θ)) = argmax y,h θ T Ψ(x,y,h) Learningmax θ Σ i Σ h i log (P θ (y i,h i |x i )) ✔ Models uncertainty in latent variables ✖ Doesn’t model accuracy of latent variable prediction ✖ No user-defined loss function

Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

Previous Methods Our Framework Optimization Results Outline

Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

Previous Methods Our Framework Optimization Results Outline

Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Observation: When Δ is independent of true h, our framework is equivalent to Latent SVM Observation: When Δ is independent of true h, our framework is equivalent to Latent SVM Concave-Convex Procedure (CCCP)

Previous Methods Our Framework Optimization Results Outline

Object Detection Bison Deer Elephant Giraffe Llama Rhino Input x Output y = “Deer” Latent Variable h Mammals Dataset 60/40 Train/Test Split 5 Folds Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss

Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC /40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss Statistically Significant

Two separate distributions –Conditional probability of latent variables –Delta distribution for prediction Generalizes latent SVM Future work –Large-scale efficient optimization –Distribution over w –New applications Conclusions

Code available at