BSP: An iterated local search heuristic for the hyperplane with minimum number of misclassifications Usman Roshan.

Slides:



Advertisements
Similar presentations
Regularized risk minimization
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Regularization David Kauchak CS 451 – Fall 2013.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Machine learning optimization
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Classification and risk prediction Usman Roshan. Disease risk prediction What is the best method to predict disease risk? –We looked at the maximum likelihood.
Support Vector Machines for Multiple- Instance Learning Authors: Andrews, S.; Tsochantaridis, I. & Hofmann, T. (Advances in Neural Information Processing.
Support Vector Machines Formulation  Solve the quadratic program for some : min s. t.,, denotes where or membership.  Different error functions and measures.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Support Vector Machines
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Ensemble Learning (2), Tree and Forest
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
Optimal Component Analysis Optimal Linear Representations of Images for Object Recognition X. Liu, A. Srivastava, and Kyle Gallivan, “Optimal linear representations.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Biointelligence Laboratory, Seoul National University
Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.
Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support vector machines
Machine Learning – Classification David Fenyő
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Lecture 07: Soft-margin SVM
Bounding the error of misclassification
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Trees, bagging, boosting, and stacking
Regularized risk minimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Basic machine learning background with Python scikit-learn
Kernels Usman Roshan.
Multi-layer perceptron
Lecture 07: Soft-margin SVM
Large Scale Support Vector Machines
Support vector machines
Multi-layer perceptron
Usman Roshan CS 675 Machine Learning
Support vector machines
Least squares linear classifier
Support vector machines
Empirical risk minimization
Concave Minimization for Support Vector Machine Classifiers
COSC 4368 Machine Learning Organization
Presentation transcript:

BSP: An iterated local search heuristic for the hyperplane with minimum number of misclassifications Usman Roshan

Linear classifiers Popular methods such as the support vector machine, perceptron, and logistic regression are linear classifiers Known to be fast and accurate (in combination with feature selection and feature learning) Empirical comparison of linear SVM, Caruna and Niculescu-Mizil, ICML 2006 Fast SVM implementation liblinear, Hsieh et. al. ICML 2008

Support vector machine Regularized hinge loss minimizer Optimization objective: Convex and solved with a quadratic program

Support vector machine y x w

NP-hard optimization Determining the plane with the minimum number of misclassifications is NP-hard (Ben- David, Eiron, and Long, JCSS, 2003) Nothing known about error bounds Objective (also known as 0/1 loss):

Modified objective Difficulty in optimizing the number of misclassifications with local search

Modified objective Thus we consider a modified objective by adding the hinge loss. This guides our search and avoids getting trapped in local optima

Motivation for our objective Consider the objective value of error + hingeloss of solid and dashed lines. Now consider the objective value of just hingeloss of both lines.

Motivation for 0/1 loss From Nguyen and Sanner, ICML 2013

Our coordinate descent (local search)

Local search

Local minima A major problem for optimizing NP-hard problems How to overcome? – Random restarts – Simulated annealing – MCMC – Iterated local search

Iterated local search

Results Data: 33 randomly selected datasets from UCI repository Some were multi-class and converted into binary (largest class separated from the others) We created ten random 90:10 splits for each dataset. All model parameters were learnt on the training dataset

Effect of iterated local search

Variants of BSP BSP: optimize our objective of error + loss BSPH: optimize hingeloss alone BSPE: optimize error alone BootstrapBSP: 10 bootstraps 100 iterated local search iterations

Effect of search parameters on BooststrapBSP (100 bootstraps)

Train and test error (breast cancer, 569 rows, 30 columns, PCC=0.82)

Train and test error (musk, 476 rows, 166 columns, PCC=0.29)

Train and test error (climate simulation, 540 rows, 18 columns, PCC=0.38)

Results on 33 datasets

Comparison to liblinear Average error across 33 datasets: – liblinear = 12.4% – BSP = 12.4% Average runtime across 33 datasets – Liblinear (including time for cross-validation) = 2 mins (median = 0.5 mins) – BSP (including time for C and 100 bootstraps) = 11 mins (median = 5 mins) Bootstrapped libsvm has 12.3% error but takes considerably longer

Conclusion ICML 2013 proposes a branch and bound search for 0/1 loss BSP: local search heuristic to optimize 0/1 loss (number of misclassifications) Our program has comparable accuracy and runtime to liblinear – fastest and among accurate linear classifiers Open questions: – Are there datasets where BSP would perform considerably better? – Can we add a second offset (w_0) parameter to account for hard datasets? – Is our search deep enough?