Efficient Large-Scale Structured Learning

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Linear Classifiers (perceptrons)

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Classification using intersection kernel SVMs is efficient Joint work with Subhransu Maji and Alex Berg Jitendra Malik UC Berkeley.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Optimization Tutorial

CMPUT 466/551 Principal Source: CMU

Face Alignment at 3000 FPS via Regressing Local Binary Features

Learning Structural SVMs with Latent Variables Xionghao Liu.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

AdaBoost & Its Applications

Face detection Many slides adapted from P. Viola.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

Fast intersection kernel SVMs for Realtime Object Detection

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Discriminative and generative methods for bags of features

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Sparse vs. Ensemble Approaches to Supervised Learning

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.

Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.

Efficient Region Search for Object Detection Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas at Austin.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Predicting Good Probabilities With Supervised Learning

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Boris Babenko, Steve Branson, Serge Belongie University of California, San Diego ICCV 2009, Kyoto, Japan.

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC

Learning from Big Data Lecture 5

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Hybrid Classiﬁers for Object Classiﬁcation with a Rich Background M. Osadchy, D. Keren, and B. Fadida-Specktor, ECCV 2012 Computer Vision and Video Analysis.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.

CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 17, 2016.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Spatial Localization and Detection

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

BSP: An iterated local search heuristic for the hyperplane with minimum number of misclassifications Usman Roshan.

Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.

Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.

Recent developments in object detection

Neural networks and support vector machines

Cascade for Fast Detection

Lit part of blue dress and shadowed part of white dress are the same color

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Privacy-Preserving Classification

Bilinear Classifiers for Visual Recognition

Object Detection + Deep Learning

RCNN, Fast-RCNN, Faster-RCNN

Boris Babenko, Steve Branson, Serge Belongie

Presentation transcript:

Efficient Large-Scale Structured Learning Steve Branson Oscar Beijbom Serge Belongie Caltech UC San Diego UC San Diego CVPR 2013, Portland, Oregon

Overview Structured prediction Learning from larger datasets TINY IMAGES Large Datasets Mammal Primate Hoofed Mammal Odd-toed Gorilla Deformable part models Object detection Orangutan Even-toed Cost sensitive Learning

Overview Available tools for structured learning not as refined as tools for binary classification 2 sources of speed improvement Faster stochastic dual optimization algorithms Application-specific importance sampling routine Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

Summary Usually, train time = 1-10 times test time Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

Summary Deformable part models 50-1000 faster than Mammal Primate Hoofed Mammal Odd-toed Gorilla Orangutan Even-toed Deformable part models 50-1000 faster than SVMstruct Mining hard negatives SGD-PEGASOS Cost-sensitive multiclass SVM 10-50 times faster than SVMstruct As fast as 1-vs-all binary SVM

Binary vs. Structured Structured Dataset Binary Learner Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. 𝑌=−1 𝑌=+1 𝑌=(𝑥,𝑦,𝑤,ℎ)

Binary vs. Structured Structured Dataset Binary Learner Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. Pros: binary classifier is application independent Cons: what is lost in terms of: Accuracy at convergence? Computational efficiency?

≈ ≈ Binary vs. Structured ∆01 Source of Computational Speed Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) Source of Computational Speed

≈ ≈ ≈ Binary vs. Structured ∆01 ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌) Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) ≈ ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌) Convex Upper Bound on Structured Prediction Loss

Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

Structured SVM SVMs w/ structured output Max-margin MRF [Taskar et al. NIPS’03] [Tsochantaridis et al. ICML’04]

Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size Faster on multiple passes Detect convergence Less sensitive to regularization/learning rate SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

Structured SVM Solvers Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMs Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

Structured SVM Solvers Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMS Regularization: λ Approx. factor: ϵ Trainset size: n Prediction time: T Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct 𝑂 𝑇𝑛 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝑛 log⁡ 1/𝜖 𝜆 [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

Our Approach Use faster stochastic dual algorithms Incorporate application-specific importance sampling routine Reduce train times when prediction time T is large Incorporate tricks people use for binary methods Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

Our Approach For t=1… do Choose random training example (Xi,Yi) 𝑌 1 , 𝑌 2 ,…, 𝑌 𝐾 ←ImportanceSample( 𝑋 𝑖 , 𝑌 𝑖 ; 𝑤 𝑡−1 ) Approx. maximize Dual SSVM objective w.r.t. i end (Provably fast convergence for simple approx. solver) evaluating 1 dot product per sample 𝑌 𝑘 Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

Recent Papers w/ Similar Ideas Augmenting cutting plane SSVM w/ m-best solutions Applying stochastic dual methods to SSVMs A. Guzman-Rivera, P. Kohli, D. Batra. “DivMCuts…” AISTATS’13. S. Lacoste-Julien, et al. “Block-Coordinate Frank-Wolfe…” JMLR’13 .

Applying to New Problems Define loss function Δ 𝑌, 𝑌 𝑖 Implement feature extraction routine 𝜓 𝑋,𝑌 Implement importance sampling routine 3. Importance sampling routine 2. Features 1. Loss function

Applying to New Problems 3. Implement importance sampling routine Is fast Favor samples w/ High loss 𝑤 𝑇 𝜓 𝑋 𝑖 , 𝑌 𝑘 +Δ 𝑌 𝑘 , 𝑌 𝑖 Uncorrelated features: small 𝜓 𝑋 𝑖 , 𝑌 𝑗 ∙ 𝜓 𝑋 𝑖 , 𝑌 𝑘

Example: Object Detection 2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Add sliding window & loss into dense score map Greedy NMS 1. Loss function Δ 𝑌, 𝑌 𝑖 = 𝑎𝑟𝑒𝑎(𝑌∩ 𝑌 𝑔𝑡 ) 𝑎𝑟𝑒𝑎(𝑌∪ 𝑌 𝑔𝑡 )

Example: Deformable Part Models 2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Dynamic programming Modified NMS to return diverse set of poses 1. Loss function Δ 𝑌, 𝑌 𝑖 = sum of part losses

Cost-Sensitive Multiclass SVM cat dog ant fly car bus cat dog ant fly car bus 2. Features e.g., bag-of-words 3. Importance sampling routine Return all classes Exact solution using 1 dot product per class 1. Loss function Class confusion cost Δ 𝑐𝑎𝑡,𝑎𝑛𝑡 = 4

Results: CUB-200-2011 Pose mixture model, 312 part/pose detectors Occlusion/visibility model Tree-structured DPM w/ exact inference

Results: CUB-200-2011 5794 training examples 400 training examples ~100X faster than mining hard negatives and SVMstruct 10-50X faster than stochastic sub-gradient methods Close to convergence at 1 pass through training set

Results: ImageNet Comparison to other fast linear SVM solvers Comparison to other methods for cost-sensitive SVMs Faster than LIBLINEAR, PEGASOS 50X faster than SVMstruct

Conclusion Orders of magnitude faster than SVMstruct Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

Thanks!

Weaknesses Less easily parallelizable than methods based on 1-vs-all Although we do offer multithreaded version Focused on SVM-based learning algorithms