Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
An Introduction of Support Vector Machine
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Model Assessment, Selection and Averaging
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Evaluation.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Chapter 5 Part II 5.3 Spread of Data 5.4 Fisher Discriminant.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Intelligible Models for Classification and Regression
Ensemble Learning (2), Tree and Forest
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Machine Learning CS 165B Spring 2012
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
by B. Zadrozny and C. Elkan
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Participation in the NIPS 2003 Challenge Theodor Mader ETH Zurich, Five Datasets were provided for experiments: ARCENE: cancer diagnosis.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Benk Erika Kelemen Zsolt
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
EUROCONTROL EXPERIMENTAL CENTRE INNOVATIVE RESEARCH Characteristics in Flight Data Estimation with Logistic Regression and Support Vector Machines ICRAT.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
PSMS for Neural Networks on the Agnostic vs Prior Knowledge Challenge Hugo Jair Escalante, Manuel Montes and Enrique Sucar Computer Science Department.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Non-Bayes classifiers. Linear discriminants, neural networks.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Feature extraction using fuzzy complete linear discriminant analysis The reporter : Cui Yan
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Classification Ensemble Methods 1
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
Validation methods.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
An Automatic Method for Selecting the Parameter of the RBF Kernel Function to Support Vector Machines Cheng-Hsuan Li 1,2 Chin-Teng.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Feature Machine for DNA microarray data
COMP61011 : Machine Learning Ensemble Models
Generalization ..
Learning with information of features
Dynamic Category Profiling for Text Filtering and Classification
Presentation transcript:

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray Somorjai, Institute for Biodiagnostics, NRC Canada,

Outline of the presentation Description of the algorithm Results on Agnostic Learning vs. Prior Knowledge (AL vs. PK) challenge datasets Conclusions

Motivation to enter the Challenge For small sample size / high dimensional datasets, the feature selection procedure will adapt to the peculiarities of the training dataset (sample bias). An ideal model selection procedure would produce stable estimates of the classification error rate and the identities of discovered features would not vary much across the different random splits. Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets produced results more robust to the sample bias and demonstrated the property stated above. Decided to check LP-SVM’s robustness property in a controlled experiment- an independent platform of the AL vs. PK challenge.

Classification with LP-SVM The formulation of LP-SVM (known as Liknon, Bhattacharya et al.) is very similar to the conventional linear SVM, except for the objective function, which is linear, due to the L1 norm of the regularization term. The solution of the LP-SVM is a linear discriminant, in which the weight magnitudes identify those original features that are important for class discrimination. Different values of the regularization parameter C in the optimization problem produce different discriminants.

Outline of the algorithm 1)Available training data are processed in 10-fold stratified (based on the existing proportions of the class samples) crossvalidation: 9/10 of the data are for training, 1/10 for independent testing. 2) The training portion is split randomly into balanced training and unbalanced monitoring sets. 3) We perform 31 random splits.

Evolution of the models 4)The training set is used to find several LP-SVM discriminants, determined by the sequence of values of the regularization parameter C, in every split. Increasing C increases the number of features. 5)A balanced error rate (BER) for every discriminant is estimated on the monitoring set. 6)The discriminant / model with the smallest monitoring BER is retained.

Example of the evolution of the models on a synthetic data

Feature profiles and other classifiers 7)In a single fold, a feature profile is derived, by counting the frequency of inclusion of the features in the best BER discriminant (we have 31 best BER discriminants). 8) As a result, we have an ensemble of linear discriminants operating on the selected features and the feature profile is to be tested with other classifiers (Several thresholds of the frequency of inclusion were examined for different datasets, to test state-of-art classification rules, such as KNN, fisher. Etc.).

Final model selection 9)The performance of all competing models derived in a single fold is estimated by BER on the independent test set. Thus we have 10 estimates. 9)The final model is selected out of the 10 estimated models. 10)The identities of the features occurring in all profiles can also be examined separately.

Experimental setup: algorithmic parameters for AL vs. PK datasets T1+T2 - size of the training set; M1+M2 - size of the monitoring set; V1+V2 - size of the validation set; Dim - dimensionality of the data; Models - number of the models tested; Th - threshold of the frequency of inclusion of feature in the feature profile.

ADA results Identity: the identities of the features occurring in all profiles 100%- 2, 8, 9, 18, 20, 24, 30 Last 1, test err Last 2, test err Last 3, test err (Th 55%)

GINA results Identity: the identities of the features occurring in all profiles More than 85% - 367, 815, 510, 648, 424 Last 3: knn1: 0.060, knn3: 0.058, ens: Last 1, test err Last 2, test err Last 3, test err (Th 50%)

HIVA results Identity: the identities of the features occurring in all profiles 90% Last 1, test err Last 2, test err Last 3, test err Th 20% Best entry ( former)

NOVA results Identity: the identities of the features occurring in all profiles 100% Last 1, test err Last 2, test err Last 3, test err Th 80% Best entry ( former)

SYLVA results Identity: the identities of the features occurring in all profiles 100% - 202, 55 Last 1, test err Last 2, test err Last 3, test err Th 20%

Determination of C values Given N 1 and N 2 measurements x of individual feature k in two classes, C value is: Sort the C values corresponding to d features in ascending order and solve a model for each. The idea behind comes from the analysis of the dual. If many features, then many models have to be solved. Computationally not feasible, C values have to be condensed.

Different ways of condensing C The challenge submissions differed in how the C values were chosen. Initially a histogram was used. Based on the final ranking this method worked out better for HIVA and NOVA. In the last submissions a rate of change of a slightly modified objective function of primal was used. This worked better for ADA, GINA and SYLVA. Still looking for a less heuristic and more precise method…

Conclusions The main advantages of our method are simplicity and the interpretability of the results. The disadvantage is high computational burden. Ensembles tend to perform better than individual rules, except for GINA. Same feature identities were consistently discovered in all splits and folds. The derived feature identities have to be compared with the ground truth in the Prior Knowledge track. Some arbitrariness, unavoidable in this experiment, will be dealt with in the future work- the threshold in feature profile, number of samples for training and monitoring, number of splits, number of the models.

Many thank’s To Muoi Tran for discussions and support, For your attention!