Loss-based Learning with Weak Supervision M. Pawan Kumar.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Self-Paced Learning for Semantic Segmentation

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Learning Specific-Class Segmentation from Diverse Data M. Pawan Kumar, Haitherm Turki, Dan Preston and Daphne Koller at ICCV 2011 VGG reading group, 29.

Curriculum Learning for Latent Structural SVM

Efficient Large-Scale Structured Learning

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Loss-based Visual Learning with Weak Supervision M. Pawan Kumar Joint work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura.

Max-Margin Latent Variable Models M. Pawan Kumar.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Learning Structural SVMs with Latent Variables Xionghao Liu.

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.

Learning to Segment with Diverse Data M. Pawan Kumar Stanford University.

Introduction to Predictive Learning

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.

Reduced Support Vector Machine

Ensemble Learning: An Introduction

Binary Classification Problem Learn a Classifier from the Training Set

Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.

1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,

Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Efficient Model Selection for Support Vector Machines

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Latent (S)SVM and Cognitive Multiple People Tracker.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Self-paced Learning for Latent Variable Models

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Object Detection with Discriminatively Trained Part Based Models

Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Learning from Big Data Lecture 5

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

“Joint Optimization of Cascaded Classifiers for Computer Aided Detection” by M.Dundar and J.Bi Andrey Kolobov Brandon Lucia.

Optimizing Average Precision using Weakly Supervised Data Aseem Behl 1, C.V. Jawahar 1 and M. Pawan Kumar 2 1 IIIT Hyderabad, India, 2 Ecole Centrale Paris.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Evaluating Classifiers

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Trees, bagging, boosting, and stacking

Basic Algorithms Christina Gallner

CS 4/527: Artificial Intelligence

Group Norm for Learning Latent Structural SVMs

A New Boosting Algorithm Using Input-Dependent Regularizer

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Loss-based Learning with Weak Supervision M. Pawan Kumar

About the Talk Methods that use latent structured SVM A little math-y Initial stages

Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009 Outline

Weakly Supervised Data Input x Output y  {-1,+1} Hidden h x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = +1 h

Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SSVM

Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w)) Learning Latent SSVM Minimize empirical risk specified by loss function

Δ(y i,y i (w))w T Ψ(x,y i (w),h i (w)) + - w T Ψ(x,y i (w),h i (w)) Δ(y i,y i (w))≤ w T Ψ(x,y i (w),h i (w)) +- max h i w T Ψ(x,y i,h i ) Δ(y i,y)}≤ max y,h {w T Ψ(x,y,h) + - max h i w T Ψ(x,y i,h i ) Training data {(x i,y i ), i = 1,2,…,n} Learning Latent SSVM

Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Learning Latent SSVM

Start with an initial estimate of w min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - w T Ψ(x i,y i,h i *) ≤ ξ i CCCP Impute hidden variables h i * = argmax h w T Ψ(x i,y i,h) Update w Repeat until convergence Loss independent Loss dependent

min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap

Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Joint Work with Aseem Behl and C. V. Jawahar Outline

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81

Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Yue, Finley, Radlinski and Joachims, 2007

Supervised Learning - Input Training images XBounding boxes H P N = {H P,H N }

Supervised Learning - Output Ranking matrix Y Y ik = +1 if i is better ranked than k -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y*

SSVM Formulation Ψ(X,Y,{H P,H N }) Σ i  P Σ k  N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =

Prediction using SSVM Y(w) = argmax Y w T Ψ(X,Y, {H P,H N }) Sort by value of sample score w T Φ(x i,h i ) Same as standard binary SVM

Learning SSVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction

Learning SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P,H N }) + - w T Ψ(X,Y(w),{H P,H N })

Learning SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P,H N }) + - w T Ψ(X,Y*,{H P,H N })

Learning SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y ≤ ξ min w ||w|| 2 + C ξ

Learning SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y ≤ ξ min w ||w|| 2 + C ξ Loss Augmented Inference

Rank 1Rank 2Rank 3 Rank positives according to sample scores

Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Rank negatives according to sample scores

Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Slide best negative to a higher rank Continue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference

Recap Scoring function w T Ψ(X,Y,{H P,H N }) Y(w) = argmax Y w T Ψ(X,Y, {H P,H N }) Prediction Learning Using optimal loss augmented inference

Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline

Weakly Supervised Learning - Input Training images X

Weakly Supervised Learning - Latent Training images X Bounding boxes H P All bounding boxes in negative images are negative

Intuitive Prediction Procedure Select the best bounding boxes in all images

Intuitive Prediction Procedure Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Rank them according to their sample scores

Ranking matrix Y Y ik = +1 if i is better ranked than k -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y* Weakly Supervised Learning - Output

Latent SSVM Formulation Ψ(X,Y,{H P,H N }) Σ i  P Σ k  N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =

Prediction using Latent SSVM max Y,H w T Ψ(X,Y, {H P,H N })

Prediction using Latent SSVM max Y,H w T Σ i  P Σ k  N Y ik (Φ(x i,h i )-Φ(x k,h k )) Choose best bounding box for positives Choose worst bounding box for negatives Not what we wanted

Learning Latent SSVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction

Learning Latent SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y(w),{H P (w),H N (w)})

Learning Latent SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y*,{H P,H N }) max H

Learning Latent SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max H max Y, H ≤ ξ min w ||w|| 2 + C ξ

Learning Latent SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max H max Y, H ≤ ξ min w ||w|| 2 + C ξ Loss Augmented Inference Cannot be solved optimally

Recap Unintuitive prediction Non-optimal loss augmented inference Can we do better? Unintuitive objective function

Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline

Latent AP-SVM Formulation Ψ(X,Y,{H P,H N }) Σ i  P Σ k  N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =

Prediction using Latent AP-SSVM Choose best bounding box for all samples Optimize over the ranking h i (w) = argmax h w T Φ(x i,h) Y(w) = argmax Y w T Ψ(X,Y, {H P (w),H N (w)}) Sort by sample scores

Learning Latent AP-SVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction

Learning Latent AP-SVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y(w),{H P (w),H N (w)})

Learning Latent AP-SVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y*,{H P (w),H N (w)})

Learning Latent AP-SVM Δ(Y*,Y) w T Ψ(X,Y,{H P (w),H N }) + - w T Ψ(X,Y*,{H P (w),H N }) max Y, H N

Learning Latent AP-SVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y, H N min H P H P (w) minimizing the above upper bound ≤ ξ min w ||w|| 2 + C ξ

Start with an initial estimate of w CCCP Impute hidden variables Update w Repeat until convergence

Above algorithm is optimal. Imputing Hidden Variables Choose best bounding boxes according to sample score

Start with an initial estimate of w CCCP Impute hidden variables Update w Repeat until convergence

Loss Augmented Inference Choose best bounding boxes according to sample score

Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Slide best negative to a higher rank Continue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference

Recap Intuitive prediction Optimal loss augmented inference Performance in practice? Intuitive objective function

Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline

VOC 2011 action classification 10 action classes + other 2424 ‘trainval’ images 2424 ‘test’ images –Hidden annotations –Evaluated using a remote server –Only AP values are computed Dataset

Latent SSVM with 0/1 loss (latent SVM) –Relative loss weight C –Relative positive sample weight J –Robustness threshold K Latent SSVM with AP loss (latent SSVM) –Relative loss weight C –Approximate greedy inference algorithm 5 random initializations 5-fold cross-validation (80-20 split) Baselines

Cross-Validation Statistically significant improvement

Test

Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Joint Work with Wojciech Zaremba, Alexander Gramfort and Matthew Blaschko IPMI 2013

M/EEG Data

Faster activation (familiar with task)

M/EEG Data Slower activation (bored with task)

Classifying M/EEG Data Statistically significant improvement

Functional Connectivity visual cortex → deep subcortical source visual cortex → higher level cognitive processing Connected components have similar delay

Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Joint Work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura Azzabou, Pierre Carlier MICCAI 2013

Training Data Annotators provide ‘hard’ segmentation

Training Data Annotators provide ‘hard’ segmentation Random Walks provides ‘soft’ segmentation Best ‘soft’ segmentation?

Segmentation Statistically significant improvement

To Conclude … Choice of loss function matters during training Many interesting latent variables –Computer Vision (onerous annotations) –Medical Imaging (impossible annotations) Large-scale experiments –Other problems –General loss –Efficient Optimization

Questions?

SPLENDID Nikos Paragios Equipe Galen INRIA Saclay Daphne Koller DAGS Stanford Machine Learning Weak Annotations Noisy Annotations Applications Computer Vision Medical Imaging Self-Paced Learning for Exploiting Noisy, Diverse or Incomplete Data Visits between INRIA Saclay and Stanford University