Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar
PASCAL VOC “Jumping” Classification Features Processing Training Classifier
PASCAL VOC Features Processing Training Classifier Think of a classifier !!! “Jumping” Classification ✗
PASCAL VOC Features Processing Training Classifier Think of a classifier !!! ✗ “Jumping” Ranking
Ranking vs. Classification Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision= 1
Ranking vs. Classification Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision= 1Accuracy= 1 = 0.92 = 0.67 = 0.81
Ranking vs. Classification Ranking is not the same as classification Average precision is not the same as accuracy Should we use 0-1 loss based classifiers? No (basic “machine learning” principle) !!
Structured Output SVM Optimizing Average Precision High-Order Information Missing Information Related Work Taskar, Guestrin and Koller, NIPS 2003; Tsochantaridis, Hofmann, Joachims and Altun, ICML 2004 Outline
Structured Output SVM Input xOutput yJoint Feature Ψ(x,y) Scoring function s(x,y;w) = w T Ψ(x,y) Prediction y(w) = argmax y s(x,y;w)
Training data {(x i,y i ), i = 1,2,…,m} Δ(y i,y i (w)) Loss function for i-th sample Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur) Parameter Estimation
Training data {(x i,y i ), i = 1,2,…,m} Δ(y i,y i (w))w T Ψ(x,y i (w)) +- w T Ψ(x,y i (w)) ≤ w T Ψ(x,y i (w)) +Δ(y i,y i (w)) - w T Ψ(x,y i ) ≤ max y { w T Ψ(x,y) + Δ(y i,y) } - w T Ψ(x,y i ) ConvexSensitive to regularization of w Parameter Estimation
Training data {(x i,y i ), i = 1,2,…,m} w T Ψ(x,y) + Δ(y i,y) - w T Ψ(x,y i ) ≤ ξ i for all y min w ||w|| 2 + C Σ i ξ i Quadratic program, which only requires cutting planes Parameter Estimation max y { w T Ψ(x,y) + Δ(y i,y) }
Training data {(x i,y i ), i = 1,2,…,m} s(x,y;w) + Δ(y i,y) - s(x,y i ;w) ≤ ξ i for all y min w ||w|| 2 + C Σ i ξ i Quadratic program, which only requires cutting planes Parameter Estimation max y { s(x,y;w) + Δ(y i,y) }
Problem Formulation –Input –Output –Joint Feature Vector or Scoring Function Learning Formulation –Loss function (‘test’ evaluation criterion) Optimization for Learning –Cutting plane (loss-augmented inference) Prediction –Inference Recap
Structured Output SVM Optimizing Average Precision (AP-SVM) High-Order Information Missing Information Related Work Yue, Finley, Radlinski and Joachims, SIGIR 2007 Outline
Problem Formulation Single Input X Φ(x i ) for all i P Φ(x k ) for all k N
Problem Formulation Single Output R R ik = +1 if i is better ranked than k -1 if k is better ranked than i
Problem Formulation Scoring Function s i (w) = w T Φ(x i ) for all i P s k (w) = w T Φ(x k ) for all k N S(X,R;w) = Σ i P Σ k N R ik (s i (w) - s k (w))
Learning Formulation Loss Function Δ(R*,R) = 1 – AP of rank R
Optimization for Learning Optimal greedy algorithm is O(|P||N|) run time. Cutting Plane Computation Yue, Finley, Radlinski and Joachims, SIGIR 2007
Ranking Sort in decreasing order of individual score s i (w) Yue, Finley, Radlinski and Joachims, SIGIR 2007
Experiments PASCAL VOC 2011 Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking ImagesClasses 10 ranking tasks Cross-validation Poselets Features
AP-SVM vs. SVM PASCAL VOC ‘test’ Dataset Difference in AP Better in 8 classes, tied in 2 classes
AP-SVM vs. SVM Folds of PASCAL VOC ‘trainval’ Dataset Difference in AP AP-SVM is statistically better in 3 classes SVM is statistically better in 0 classes
Structured Output SVM Optimizing Average Precision High-Order Information (M4-AP-SVM) Missing Information Related Work Kumar, Behl, Jawahar and Kumar, Submitted Outline
High-Order Information People perform similar actions People strike similar poses Objects are of same/similar sizes “Friends” have similar habits How can we use them for ranking? classification
Problem Formulation x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 Ψ(x,y) = Ψ1(x,y)Ψ1(x,y) Ψ2(x,y)Ψ2(x,y) Unary Features Pairwise Features
Learning Formulation x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 Δ(y*,y) = Fraction of incorrectly classified persons
Optimization for Learning x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 max y w T Ψ(x,y) + Δ(y*,y) Graph Cuts (if supermodular) LP Relaxation, or exhaustive search
Classification x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 max y w T Ψ(x,y) Graph Cuts (if supermodular) LP Relaxation, or exhaustive search
Ranking? x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 Use difference of max-marginals
Max-Marginal for Positive Class x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 mm + (i;w) = max y,y i =+1 w T Ψ(x,y) Best possible score when person i is positive Convex in w
Max-Marginal for Negative Class x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 mm - (i;w) = max y,y i =-1 w T Ψ(x,y) Best possible score when person i is negative Convex in w
Ranking x Input x = {x 1,x 2,x 3 } Output y = {-1,+1} 3 s i (w) = mm + (i;w) – mm - (i;w) Difference-of-Convex in w Use difference of max-marginals HOB-SVM
Ranking s i (w) = mm + (i;w) – mm - (i;w) Why not optimize AP directly? Max-Margin Max-Marginal AP-SVM M4-AP-SVM
Problem Formulation Single Input X Φ(x i ) for all i P Φ(x k ) for all k N
Problem Formulation Single Input R R ik = +1 if i is better ranked than k -1 if k is better ranked than i
Problem Formulation Scoring Function s i (w) = mm + (i;w) – mm - (i;w) for all i P S(X,R;w) = Σ i P Σ k N R ik (s i (w) - s k (w)) s k (w) = mm + (k;w) – mm - (k;w) for all k N
Learning Formulation Loss Function Δ(R*,R) = 1 – AP of rank R
Optimization for Learning Difference-of-convex program Kohli and Torr, ECCV 2006 Very efficient CCCP Linearization step by Dynamic Graph Cuts Update step equivalent to AP-SVM Kumar, Behl, Jawahar and Kumar, Submitted
Ranking Sort in decreasing order of individual score s i (w)
Experiments PASCAL VOC 2011 Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking ImagesClasses 10 ranking tasks Cross-validation Poselets Features
HOB-SVM vs. AP-SVM PASCAL VOC ‘test’ Dataset Difference in AP Better in 4, worse in 3 and tied in 3 classes
HOB-SVM vs. AP-SVM Folds of PASCAL VOC ‘trainval’ Dataset Difference in AP HOB-SVM is statistically better in 0 classes AP-SVM is statistically better in 0 classes
M4-AP-SVM vs. AP-SVM PASCAL VOC ‘test’ Dataset Better in 7, worse in 2 and tied in 1 class Difference in AP
M4-AP-SVM vs. AP-SVM Folds of PASCAL VOC ‘trainval’ Dataset M4-AP-SVM is statistically better in 4 classes AP-SVM is statistically better in 0 classes Difference in AP
Structured Output SVM Optimizing Average Precision High-Order Information Missing Information (Latent-AP-SVM) Related Work Outline Behl, Jawahar and Kumar, CVPR 2014
Fully Supervised Learning
Weakly Supervised Learning Rank images by relevance to ‘jumping’
Use Latent Structured SVM with AP loss –Unintuitive Prediction –Loose Upper Bound on Loss –NP-hard Optimization for Cutting Planes Carefully design a Latent-AP-SVM –Intuitive Prediction –Tight Upper Bound on Loss –Optimal Efficient Cutting Plane Computation Two Approaches
Results
Structured Output SVM Optimizing Average Precision High-Order Information Missing Information (Latent-AP-SVM) Related Work Outline Mohapatra, Jawahar and Kumar, In Preparation
conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 fcA fcB Softmax + cross-entropy loss W AP loss W AP-CNN Small but statistically significant improvements
Questions? Code + Data Available