Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V. Jawahar (IIIT Hyderabad)
Input x Output y = “Using Computer” Latent Variable h Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i Action Classification Aim - To estimate accurate model parameters by optimizing average precision with weakly supervised data
Preliminaries Previous Work Our Framework Results Conclusion Outline
Binary Classification Several problems in computer vision can be formulated as binary classification tasks. Running example: Action Classification, ie, automatically figuring out whether an image contains a person performing an action of interest (such as ‘jumping’ or ‘walking’). Binary classifier widely employed in computer vision is the support vector machine (SVM).
Conventional SVMs Input examples x i (vector) Output labels y i (either +1 or -1) SVM learns a hyperplane w Predictions are sign(w T Φ i (x i )) Training involves solving the following: min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ i : y i (w T Φ i (x i )) ≥ 1 - ξ i The sum of slacks Σ i ξ i upper bounds the 0/1 loss
Structural SVM (SSVM) Generalization of the SVM to structured output spaces min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ ŷ i : w T (x i,y i ) – w T Ψ(x i, ŷ i ) ≥ Δ(y i, ŷ i ) - ξ i Joint score for the correct label at least as large as incorrect label plus the loss. Number of constraints is the |dom(y)|. Thus, number of constraints can be intractably large. At least one constraint for where inequality is tight. “most violated constraint”. Learning: Y pred = argmax y w T Ψ(x,y) Prediction: Maximize the score over all possible outputs.
Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)
Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)
Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)
Structural SVM Learning Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Slide taken from Yue et al. (2007)
Structural SVM Learning 1: Solve the SVM objective function using only the current working set of constraints. 2: Using the model learned in step 1, find the most violated constraint from the exponential set of constraints. 3: If the constraint returned in step 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set. Repeat steps 1-3 until no additional constraints are added. Return the most recent model that was trained in step 1. Steps 1-3 are guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al. 2005]
Weak Supervision Supervised learning involves the onerous task of collecting detailed annotations for each training sample. Financially infeasible as the size of the datasets grow. Weak supervision – additional annotations h i are unknown More complex machine learning problem.
Weak Supervision – Challenges Find the best additional annotation for positive examples. Identify bounding box of the `jumping’ person in positive images Need to consider all possible values of annotations for negative samples as negative examples. Ensure that scores of `jumping’ person bounding boxes are higher than the scores of all possible bounding boxes in the negative images.
Latent SVM (LSVM) Extends SSVM to incorporate hidden information. This information is considered as part of the label. Not observed during training. min w ½ ||w|| 2 + C Σ i ξ i s.t. ∀ ŷ i,ĥ i : max h i w T Ψ (x i,y i,h i ) – w T Ψ(x i, ŷ i, ĥ i ) ≥ Δ(y i, ŷ i ) - ξ i Non-convex objectiveDifference of convex CCCP Algorithm - converges to a local minimum
Concave-Convex Procedure (CCCP) 1.Repeat until convergence: 2. Iteratively approximate the concave portion of the objective Impute the hidden variables 3. Update the value of the parameter using the values of the hidden variables Solve the resulting convex SSVM problem Steps 2-3 are guaranteed to local minima in polynomial number of iterations. [Yuille and A. Rangarajan, 2003]
Average Precision (AP) AP-loss = 0.24, 0-1 loss = 0.40 AP-loss = 0.36, 0-1 loss = 0.40 AP is the most commonly used accuracy measure for binary classification. AP is the average of the precision scores at the rank locations of each positive sample. AP-loss depends on the ranking of the samples. 0-1 loss depends only on the number of incorrectly classified samples. Prediction: Yopt = maxY w T Ψ(X,Y) A machine learning algorithm optimizing for 0/1 loss might learn a very different model than optimizing for AP.
Notation x h y = “Using Computer” 1, x i ranked higher than x j Y: ranking matrix, st. Y ij = 0, x i & x j are ranked same -1, x i ranked lower than x j X: Input {x i = 1,..,n} { H P : Additional information for positives {h i, i ∈ P} H N : Additional information for negatives {h j, j ∈ N} ∆(Y, Y ∗ ): AP-loss = 1 − AP(Y, Y ∗ ) AP(Y, Y ∗ ) = AP of ranking Y Joint Feature Vector: Ψ( X,Y*,{H P,H N } ) = (1/|P|.|N|) Σ i Σ j Y ij (Φ i (h i ) - Φ j (h j ))
AP-SVM AP-SVM optimizes the correct AP-loss function as opposed to 0/1 loss. Prediction : Y opt = max Y w T Ψ(X,Y,H) min w ½ ||w|| 2 + Cξ s.t. ∀ Y : w T Ψ(X,Y *,H) - w T Ψ(X,Y,H) ≥ Δ(Y,Y * ) - ξ Constraints are defined for each incorrect labeling Y. Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the loss. After learning w, a prediction is made by sorting samples (x k,h k ) in descending order of w T Φ k (h k ) Learning:
AP-SVM – Exponential Constraints For Average Precision, the true labeling is a ranking where the positive examples are all ranked in the front, e.g., An incorrect labeling would be any other ranking, e.g., Exponential number of incorrect rankings. Thus an exponential number of constraints.
Finding Most Violated Constraint Structural SVM requires a subroutine to find the most violated constraint. Subroutine is dependent on formulation of loss function and joint feature representation. Yue et al. came up with efficient algorithm in the case of optimizing AP.
Finding Most Violated Constraint AP is invariant to the order of examples within positive and negative examples Joint SVM score is optimized by sorting in descending order by individual examples score. Reduces to finding an interleaving between two sorted lists of examples
Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Slide taken from Yue et al. (2007)
Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Slide taken from Yue et al. (2007)
Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Slide taken from Yue et al. (2007)
Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Never want to swap past previous negative examples Slide taken from Yue et al. (2007)
Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent positive/negative examples Find the best feasible ranking of the negative example Repeat for next negative examples Never want to swap past previous negative examples Repeat until all negative examples have been considered Slide taken from Yue et al. (2007)
Hypothesis Optimizing correct loss function is important for weakly supervised learning.
Latent Structural SVM (LSSVM) Introduces a margin between the maximum score for the ground-truth output and all other pairs of output and additional annotations Compares scores between 2 different sets of annotation min w ½ ||w|| 2 + Cξ s.t. ∀ Y,H : max Ĥ {w T Ψ(X,Y *,Ĥ)} - w T Ψ(X,Y,H) ≥ Δ(Y,Y * ) - ξ Learning:
Latent Structural SVM (LSSVM) Prediction: (Y opt,H opt ) = max Y,H w T Ψ(X,Y,H) Negatives Positives
Latent Structural SVM (LSSVM) Disadvantages: Prediction: LSSVM uses an unintuitive prediction rule. Learning: LSSVM optimizes a loose upper-bound on the AP-loss. Optimization: Exact loss-augmented inference is computationally inefficient.
Latent AP-SVM - Prediction Negatives Positives Step 1: Find the best h i for each sample H opt = argmax H w T Ψ(X,Y,H) Y opt = argmax Y w T Ψ(X,Y,H opt ) Step 2: Sort samples according to best scores
Latent AP-SVM - Learning Finds the best assignment of values H P such that the score for the correct ranking is higher than the score for an incorrect ranking, regardless of the choice of H N. Compares scores between same sets of additional annotation. min w ½ ||w|| 2 + Cξ s.t. ∀ Y,H N : max Hp {w T Ψ(X,Y *,{H P,H N }) - w T Ψ(X,Y,{H P,H N })} ≥ Δ(Y,Y * ) - ξ
Latent AP-SVM - Learning Constraints of latent AP-SVM are a subset of LSSVM constraints. Optimal solution of latent AP-SVM has a lower objective than LSSVM solution. Latent AP-SVM provides a valid upper-bound on the AP-loss. Latent AP-SVM provides a tighter upper-bound on the AP Loss
Latent AP-SVM - Optimization 1.Initialize the set of parameters w 0 2.Repeat until convergence 3.Imputation of the additional annotations for positives 4. Parameter update using cutting-plane algorithm Independently choose additional annotation H P Complexity: O(n P.|H|) Maximize over H N and Y independently Complexity: O(n P.n N )
Action Classification Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 action classification 4846 images, 10 action classes 2424 trainval & 2422 test images Features activation scores of action-specific poselets & 4 object activation scores Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i
Action Classification 5-fold cross validation on the `trainval’ dataset. Statistically significant increase in performance: 6/10 classes over LSVM 7/10 classes over LSSVM Overall improvement: 5% over LSVM 4% over LSSVM
Action Classification X-axis corresponds to the amount of supervision provided. Y-axis corresponds to the mean average precision. As amount of supervision decreases, gap in performance of latent AP-SVM and the baseline methods increases.
Action Classification Performance on test-set of PASCAL Increase in performance: All classes over LSVM 8/10 classes over LSSVM Overall improvement: 5.1% compared to LSVM 3.7% over LSSVM
Object Detection PASCAL VOC 2007 object detection dataset images over 20 object categories trainval & 4952 test images. Features – 4096 dimensional activation vector of penultimate layer of a trained Convolutional Neural Network (CNN). Input x Output y = “Aeroplane” Latent Variable h (Average of 2000 candidate windows per image using the selective-search algorithm)
Object Detection 5-fold cross validation on the `trainval’ dataset. Statistically significant increase in performance for 15/20 classes over LSVM. Superior performance partially attributed to the better localization of objects by LAP-SVM during training.
Object Detection Performance on test set of PASCAL Increase in performance for 19/20 classes over LSVM. Overall improvement of 7% over LSVM. We also get improved results on the IIIT 5K-WORD dataset.
Conclusion Proposed novel formulation that obtains accurate ranking by minimizing a carefully designed upper bound on the AP loss. Showed the theoretical benefits of our method. Demonstrated advantage of our approach on challenging machine learning problems.
Thank you