Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Self-Paced Learning for Semantic Segmentation

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Curriculum Learning for Latent Structural SVM

Linear Classifiers (perceptrons)

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Loss-based Visual Learning with Weak Supervision M. Pawan Kumar Joint work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura.

Max-Margin Latent Variable Models M. Pawan Kumar.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Learning Structural SVMs with Latent Variables Xionghao Liu.

Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Reduced Support Vector Machine

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Mathematical Programming in Support Vector Machines

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.

Efficient Model Selection for Support Vector Machines

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Self-paced Learning for Latent Variable Models

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Recognition Using Visual Phrases

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Learning from Big Data Lecture 5

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Optimizing Average Precision using Weakly Supervised Data Aseem Behl 1, C.V. Jawahar 1 and M. Pawan Kumar 2 1 IIIT Hyderabad, India, 2 Ecole Centrale Paris.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning Deep Generative Models by Ruslan Salakhutdinov

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Group Norm for Learning Latent Structural SVMs

Outline Background Motivation Proposed Model Experimental Results

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V. Jawahar (IIIT Hyderabad)

Input x Output y = “Using Computer” Latent Variable h Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i Action Classification Aim - To estimate accurate model parameters by optimizing average precision with weakly supervised data

Preliminaries Previous Work Our Framework Results Conclusion Outline

Binary Classification Several problems in computer vision can be formulated as binary classification tasks. Running example: Action Classification, ie, automatically figuring out whether an image contains a person performing an action of interest (such as ‘jumping’ or ‘walking’). Binary classifier widely employed in computer vision is the support vector machine (SVM).

Conventional SVMs Input examples x i (vector) Output labels y i (either +1 or -1) SVM learns a hyperplane w Predictions are sign(w T Φ i (x i )) Training involves solving the following: min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ i : y i (w T Φ i (x i )) ≥ 1 - ξ i The sum of slacks Σ i ξ i upper bounds the 0/1 loss

Structural SVM (SSVM) Generalization of the SVM to structured output spaces min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ ŷ i : w T (x i,y i ) – w T Ψ(x i, ŷ i ) ≥ Δ(y i, ŷ i ) - ξ i Joint score for the correct label at least as large as incorrect label plus the loss. Number of constraints is the |dom(y)|. Thus, number of constraints can be intractably large. At least one constraint for where inequality is tight. “most violated constraint”. Learning: Y pred = argmax y w T Ψ(x,y) Prediction: Maximize the score over all possible outputs.

Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)

Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)

Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)

Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)

Structural SVM Learning 1: Solve the SVM objective function using only the current working set of constraints. 2: Using the model learned in step 1, find the most violated constraint from the exponential set of constraints. 3: If the constraint returned in step 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set. Repeat steps 1-3 until no additional constraints are added. Return the most recent model that was trained in step 1. Steps 1-3 are guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al. 2005]

Weak Supervision Supervised learning involves the onerous task of collecting detailed annotations for each training sample. Financially infeasible as the size of the datasets grow. Weak supervision – additional annotations h i are unknown More complex machine learning problem.

Weak Supervision – Challenges Find the best additional annotation for positive examples. Identify bounding box of the `jumping’ person in positive images Need to consider all possible values of annotations for negative samples as negative examples. Ensure that scores of `jumping’ person bounding boxes are higher than the scores of all possible bounding boxes in the negative images.

Latent SVM (LSVM) Extends SSVM to incorporate hidden information. This information is considered as part of the label. Not observed during training. min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ ŷ i,ĥ i : max h i w T Ψ (x i,y i,h i ) – w T Ψ(x i, ŷ i, ĥ i ) ≥ Δ(y i, ŷ i ) - ξ i Non-convex objectiveDifference of convex CCCP Algorithm - converges to a local minimum

Concave-Convex Procedure (CCCP) 1.Repeat until convergence: 2. Iteratively approximate the concave portion of the objective Impute the hidden variables 3. Update the value of the parameter using the values of the hidden variables Solve the resulting convex SSVM problem Steps 2-3 are guaranteed to local minima in polynomial number of iterations. [Yuille and A. Rangarajan, 2003]

Average Precision (AP) AP-loss = 0.24, 0-1 loss = 0.40 AP-loss = 0.36, 0-1 loss = 0.40 AP is the most commonly used accuracy measure for binary classification. AP is the average of the precision scores at the rank locations of each positive sample. AP-loss depends on the ranking of the samples. 0-1 loss depends only on the number of incorrectly classified samples. Prediction: Yopt = maxY w T Ψ(X,Y) A machine learning algorithm optimizing for 0/1 loss might learn a very different model than optimizing for AP.

Notation x h y = “Using Computer” 1, x i ranked higher than x j Y: ranking matrix, st. Y ij = 0, x i & x j are ranked same -1, x i ranked lower than x j X: Input {x i = 1,..,n} { H P : Additional information for positives {h i, i ∈ P} H N : Additional information for negatives {h j, j ∈ N} ∆(Y, Y ∗ ): AP-loss = 1 − AP(Y, Y ∗ ) AP(Y, Y ∗ ) = AP of ranking Y Joint Feature Vector: Ψ( X,Y*,{H P,H N } ) = (1/|P|.|N|) Σ i Σ j Y ij (Φ i (h i ) - Φ j (h j ))

AP-SVM AP-SVM optimizes the correct AP-loss function as opposed to 0/1 loss. Prediction : Y opt = max Y w T Ψ(X,Y,H) min w ½ ||w|| 2 + Cξ s.t. ∀ Y : w T Ψ(X,Y *,H) - w T Ψ(X,Y,H) ≥ Δ(Y,Y * ) - ξ Constraints are defined for each incorrect labeling Y. Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the loss. After learning w, a prediction is made by sorting samples (x k,h k ) in descending order of w T Φ k (h k ) Learning:

AP-SVM – Exponential Constraints For Average Precision, the true labeling is a ranking where the positive examples are all ranked in the front, e.g., An incorrect labeling would be any other ranking, e.g., Exponential number of incorrect rankings. Thus an exponential number of constraints.

Finding Most Violated Constraint Structural SVM requires a subroutine to find the most violated constraint. Subroutine is dependent on formulation of loss function and joint feature representation. Yue et al. came up with efficient algorithm in the case of optimizing AP.

Finding Most Violated Constraint AP is invariant to the order of examples within positive and negative examples Joint SVM score is optimized by sorting in descending order by individual examples score. Reduces to finding an interleaving between two sorted lists of examples

Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Slide taken from Yue et al. (2007)

Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Slide taken from Yue et al. (2007)

Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Slide taken from Yue et al. (2007)

Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Never want to swap past previous negative examples Slide taken from Yue et al. (2007)

Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Never want to swap past previous negative examples Repeat until all negative examples have been considered Slide taken from Yue et al. (2007)

Hypothesis Optimizing correct loss function is important for weakly supervised learning.

Latent Structural SVM (LSSVM) Introduces a margin between the maximum score for the ground-truth output and all other pairs of output and additional annotations Compares scores between 2 different sets of annotation min w ½ ||w|| 2 + Cξ s.t. ∀ Y,H : max Ĥ {w T Ψ(X,Y *,Ĥ)} - w T Ψ(X,Y,H) ≥ Δ(Y,Y * ) - ξ Learning:

Latent Structural SVM (LSSVM) Prediction: (Y opt,H opt ) = max Y,H w T Ψ(X,Y,H) Negatives Positives

Latent Structural SVM (LSSVM) Disadvantages: Prediction: LSSVM uses an unintuitive prediction rule. Learning: LSSVM optimizes a loose upper-bound on the AP-loss. Optimization: Exact loss-augmented inference is computationally inefficient.

Latent AP-SVM - Prediction Negatives Positives Step 1: Find the best h i for each sample H opt = argmax H w T Ψ(X,Y,H) Y opt = argmax Y w T Ψ(X,Y,H opt ) Step 2: Sort samples according to best scores

Latent AP-SVM - Learning Finds the best assignment of values H P such that the score for the correct ranking is higher than the score for an incorrect ranking, regardless of the choice of H N. Compares scores between same sets of additional annotation. min w ½ ||w|| 2 + Cξ s.t. ∀ Y,H N : max Hp {w T Ψ(X,Y *,{H P,H N }) - w T Ψ(X,Y,{H P,H N })} ≥ Δ(Y,Y * ) - ξ

Latent AP-SVM - Learning Constraints of latent AP-SVM are a subset of LSSVM constraints. Optimal solution of latent AP-SVM has a lower objective than LSSVM solution. Latent AP-SVM provides a valid upper-bound on the AP-loss. Latent AP-SVM provides a tighter upper-bound on the AP Loss

Latent AP-SVM - Optimization 1.Initialize the set of parameters w 0 2.Repeat until convergence 3.Imputation of the additional annotations for positives 4. Parameter update using cutting-plane algorithm Independently choose additional annotation H P Complexity: O(n P.|H|) Maximize over H N and Y independently Complexity: O(n P.n N )

Action Classification Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 action classification 4846 images, 10 action classes 2424 trainval & 2422 test images Features activation scores of action-specific poselets & 4 object activation scores Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

Action Classification 5-fold cross validation on the `trainval’ dataset. Statistically significant increase in performance: 6/10 classes over LSVM 7/10 classes over LSSVM Overall improvement: 5% over LSVM 4% over LSSVM

Action Classification X-axis corresponds to the amount of supervision provided. Y-axis corresponds to the mean average precision. As amount of supervision decreases, gap in performance of latent AP-SVM and the baseline methods increases.

Action Classification Performance on test-set of PASCAL Increase in performance: All classes over LSVM 8/10 classes over LSSVM Overall improvement: 5.1% compared to LSVM 3.7% over LSSVM

Object Detection PASCAL VOC 2007 object detection dataset images over 20 object categories trainval & 4952 test images. Features – 4096 dimensional activation vector of penultimate layer of a trained Convolutional Neural Network (CNN). Input x Output y = “Aeroplane” Latent Variable h (Average of 2000 candidate windows per image using the selective-search algorithm)

Object Detection 5-fold cross validation on the `trainval’ dataset. Statistically significant increase in performance for 15/20 classes over LSVM. Superior performance partially attributed to the better localization of objects by LAP-SVM during training.

Object Detection Performance on test set of PASCAL Increase in performance for 19/20 classes over LSVM. Overall improvement of 7% over LSVM. We also get improved results on the IIIT 5K-WORD dataset.

Conclusion Proposed novel formulation that obtains accurate ranking by minimizing a carefully designed upper bound on the AP loss. Showed the theoretical benefits of our method. Demonstrated advantage of our approach on challenging machine learning problems.

Thank you