Stable Feature Selection: Theory and Algorithms

Slides:



Advertisements
Similar presentations
Aggregating local image descriptors into compact codes
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
V ARIANCE R EDUCTION FOR S TABLE F EATURE S ELECTION Presenter: Yue Han Advisor: Lei Yu Department of Computer Science 10/27/10.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Yue Han and Lei Yu Binghamton University.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Classification with reject option in gene expression data Blaise Hanczar and Edward R Dougherty BIOINFORMATICS Vol. 24 no , pages
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Ensemble Learning (2), Tree and Forest
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Benk Erika Kelemen Zsolt
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Consensus Group Stable Feature Selection
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Bootstrap and Model Validation
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Bayesian Semi-Parametric Multiple Shrinkage
Project 4: Facial Image Analysis with Support Vector Machines
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
12 Inferential Analysis.
Bias and Variance of the Estimator
PEBL: Web Page Classification without Negative Examples
Data Mining Practical Machine Learning Tools and Techniques
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
ECE539 final project Instructor: Yu Hen Hu Fall 2005
Evaluation and Its Methods
SMEM Algorithm for Mixture Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ensembles.
12 Inferential Analysis.
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Multivariate Methods Berlin Chen
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Multivariate Methods Berlin Chen, 2005 References:
Supervised machine learning: creating a model
Physics-guided machine learning for milling stability:
Evaluation and Its Methods
Machine Learning: Lecture 5
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Stable Feature Selection: Theory and Algorithms Hello, everyone! Welcome to my dissertation defense. Let’s get started! The topic of my dissertation is about … Presenter: Yue Han Advisor: Lei Yu Ph.D. Dissertation 4/26/2012

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Feature Selection Applications Pixel Selection Gene Selection Feature selection, not only a preprocessing step to prepare data for mining tasks, but also a knowledge discovery tool to extract valuable information from data. Biologists interested into a subset of genes to explain the observed phenomenon(disease types or symptoms). Researchers in computer graphics interested into a set of expressive pixels to capture the facial expression of humans. Natural language processing engineers interested into a set of representative terms or words to achieve better understanding of the document. Sports Travel Politics Tech Artist Life Science Internet. Business Health Elections Word Selection

Feature Selection from High-dimensional Data Feature Selection Algorithms Low-Dimensional Data Learning Models p: # of features n: # of samples High-dimensional data: p >> n Curse of Dimensionality: Effects on distance functions In optimization and learning In Bayesian statistics From the examples, we can observe the number of features is huge compared to the number of samples/instances Conventional learning approaches lose effects when being applied to the high-dimensional data directly, curse of dimensionality Instead of … , feature selection is involved to reduce the data dimensionality, here is the … Feature selection can … Feature Selection: Alleviating the effect of the curse of dimensionality. Enhancing generalization capability. Speeding up learning process. Improving model interpretability. Knowledge Discovery on High-dimensional Data

Stability of Feature Selection Feature Selection Method Training Data Training Data Feature Subset Training Data Feature Subset Consistent or not??? Feature Subset As we introduced, applying a feature selection method on the training data, we get a feature subset. If we have variations to the training data(all drawn from the same data space), feature selection results consistent or not? Stability of feature selection, defined as … Stability of learning algorithm is studied … Unfortunately, stability of feature selection is under-addressed. Stability Issue of Feature Selection Stability of Feature Selection: the insensitivity of the result of a feature selection algorithm to variations to the training set. Training Data Learning Model Learning Algorithm Stability of Learning Algorithm is firstly examined by Turney in 1995 Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently.

Motivation for Stable Feature Selection Sample Space D Training Data D1 Training Data D2 Given Unlimited Sample Size: Feature selection results from D1 and D2 are the same Training data is usually a subset of sample space and it can not capture the underlying distribution. Increasing sample size is not practical and feasible for application domains like biology. Then consistent results on limited sample size are more convincing than unstable results. People may argue: shall we care about the stability of feature selection? As long as the predictive performance of learning models built on the selected features good enough even if the features are completely different. Biologist cares about … Given Limited Sample Size: (n<<p for high dimensional data) Feature selection results from D1 and D2 are different Biologists cares about: Prediction accuracy & Consistency of feature subsets; Confidence for biological validation ; Biomarkers to explain the observed phenomena.

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Feature Selection Methods Subset Generation Subset Evaluation Stopping Criterion Result Validation Original set Subset Goodness of subset no Yes Traditionally, a feature selection method involves four steps. The subset generation is guided by different search strategies. Feature selection approaches are divided into three categories according to the stopping criterion. A great variety of feature selection algorithms have been proposed, such as … Search Strategies: Complete Search Sequential Search Random Search Evaluation Criteria Filter Model Wrapper Model Embedded Model Representative Algorithms Relief, SFS, MDLM, etc. FSBC, ELSA, LVW, etc. BBHFS, Dash-Liu’s, etc.

Stable Feature Selection Comparison of Feature Selection Algorithms w.r.t. Stability (Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007) Quantify the stability in terms of consistency on subset or weight; Algorithms varies on stability and equally well for classification; Choose the best with both stability and accuracy. Bagging-based Ensemble Feature Selection (Saeys et al. ECML07) Different bootstrapped samples of the same training set; Apply a conventional feature selection algorithm; Aggregates the feature selection results. Group-based Stable Feature Selection (Yu et al. KDD08; Loscalzo et al. KDD09) Explore the intrinsic feature correlations; Identify groups of correlated features; Select relevant feature groups. There exist very few study on the stability issue of feature selection. Earlier research focused on comparing the stability of different feature selection algorithms and the proposal of new stability measures. In recent years, two approaches are proposed via exploring the sample space and the feature space respectively.

Margin based Feature Selection Sample Margin: how much can an instance travel before it hits the decision boundary Another line of research related is the margin based feature selection. Two types of margin … Hypothesis Margin: how much can the hypothesis travel before it hits an instance (Distance between the hypothesis and the opposite hypothesis of an instance) Representative Algorithms based on HM: Relief-F, G-flip, Simba, etc. margin is used for feature weighting or feature selection (totally different use in our study)

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Publications Yue Han and Lei Yu. Margin Based Sample Weighting for Stable Feature Selection. In Proceedings of the 11th International Conference on Web-Age Information Management (WAIM2010), pages 680-691, Jiuzhaigou, China, July 15-17, 2010. Yue Han and Lei Yu. A Variance Reduction Framework for Stable Feature Selection. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM2010), pages 205-215, Sydney, Australia, December 14-17, 2010. Lei Yu, Yue Han and Michael E. Berens. Stable Gene Selection from Microarray Data via Sample Weighting. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),  pages 262-272, vol. 9 no. 1, 2012. Yue Han and Lei Yu. A Variance Reduction Framework for Stable Feature Selection. Statistical Analysis and Data Mining(SADM), Accepted, 2012.

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Bias-variance Decomposition of Feature Selection Error Training Data: D Data Space: FS Result: r(D) True FS Result: r* As what I have explained in the introduction, the stability of feature selection is defined as… Then how to measure the stability of a feature selection algorithm, the consistency of feature selection results is a straightforward way. These three properties can be formulated as the … Bias-Variance Decomposition of Feature Selection Error: Relationship between accuracy(opposite of loss)&stability(opposite of variance); Suggests a better trade-off between the bias and variance of feature selection.

Bias, Variance and Error of Monte Carlo Estimator Feature Selection (Weighting)  Monte Carlo Estimator Relevance Score: Monte Carlo Estimator: Error: Bias: Variance: If we randomly pick up n samples of training data X, the relevance score is to get the average of relevant scores from different samples. This is Monte Carlo Estimator. Most feature selection algorithms simply aggregate all relevance scores cross all samples so feature selection can be considered as Monte Carlo Estimator. Extend to bias, variance, … Impact Factor: feature selection algorithm and sample size Impractical Costly Increasing Sample size

Variance Reduction via Sample Weighting Probability density function Importance Sampling A good importance sampling function h(x) Fortunately, there are different variance reduction techniques, importance sampling is one of them. h(x) is hard to estimate but … Importance Sampling More instances draw from important regions Less instances draw from other regions Instance Weighting Increase weights for instances from important regions Decrease weights for instances from other regions

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Margin Based Instance Weighting for Stable Feature Selection Overall Framework Challenges: How to produce weights for instances from the point view of feature selection stability; How to present weighted instances to conventional feature selection algorithms. Margin Based Instance Weighting for Stable Feature Selection

Margin Vector Feature Space Original Feature Space For each For each Hypothesis Margin: captures the local profile of feature relevance for all features at hit miss Nearest Hit Nearest Miss Instances exhibit different profiles of feature relevance; Instances influence feature selection results differently.

An Illustrative Example (a) Original Feature Space (b) Margin Vector Feature Space.

Extension for Hypothesis Margin of 1NN To reduce the effect of noise or outliers Hypothesis Margin of kNN(k>1) weighted kNN(k>1)

Margin Based Instance Weighting Algorithm exhibits different profiles of feature relevance influence feature selection results differently Instance Weighting Higher Outlying Degree Lower Weight Lower Outlying Degree Higher Weight Review: Variance reduction via Importance Sampling More instances draw from important regions Less instances draw from other regions Weighting: Outlying Degree:

Iterative Margin Based Instance Weighting Assumption: Instances are equally important in original feature space Original Feature Space Margin Vector Feature Space Weighted Feature Space Margin Vector Feature Space Weighted Feature Space Margin Vector Feature Space The iterative procedure always Converges fast; There exists little difference in terms of learned weights; Overall a stable procedure. Instance Weight Final Instance Weight Updated Instance Weight

Algorithm Illustration Time Complexity Analysis: Dominated by Instance Weighting: Efficient for High-dimensional Data with small sample size (n<<d)

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Objective of Empirical Study to demonstrate the bias-variance decomposition in theoretical framework; to verify the effectiveness of the proposed instance weighting framework on variance reduction; to study the impacts of variance reduction on the stability and predictive performance of the selected subsets. Training Data Feature Subset Feature Selection Method Consistent or not??? Stability of Feature Selection

Algorithms in Comparison Baseline Algorithm SVM-RFE Recursively eliminate 10 percent of training features each iteration Linear kernel and default C parameter Instance Weighting SVM-RFE IW SVM-RFE Instance weight affects error penalty and thus the choice of hyperplane Ensemble SVM-RFE En SVM-RFE 20 bootstrapped training set Aggregate different rankings into a final consensus ranking Baseline Algorithm Relief-F Hypothesis Margin (produce feature weights) Simply aggregate margin based on KNN along each feature dimension Instance Weighting Relief-F IW Relief-F Instance weight affects the aggregated feature weight Ensemble Relief-F En Relief-F 20 bootstrapped training set Aggregate different rankings into a final consensus ranking

Stability Measures, Predictive Accuracy Measures Feature Subset Jaccard Index; nPOGR; SIMv; Kuncheva Index. Feature Ranking: Spearman Coefficient Feature Weighting: Pearson Correlation Coefficient nPOGR: Kuncheva Index: Predictive Accuracy CV Accuracy: Prediction Accuracy base on Cross-validation AUC Accuracy: the area under the receiver operating characteristic (ROC) Curve

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Experiments on Synthetic Data Synthetic Data Generation: Feature Value: two multivariate normal distributions Covariance matrix is a 10*10 square matrix with elements 1 along the diagonal and 0.8 off diagonal. 100 groups and 10 feature each Class label: a weighted sum of all feature values with optimal feature weight vector 500 Training Data: 100 instances with 50 from and 50 from Leave-one-out Test Data: 5000 instances Method in Comparison: SVM-RFE: Recursively eliminate 10% features of previous iteration till 10 features remained. Measures: Variance, Bias, Error Subset Stability (Kuncheva Index) CV Accuracy (SVM)

Experiments on Synthetic Data Observations: Error is equal to the sum of bias and variance for both versions of SVM-RFE; Error is dominated by bias during early iterations and is dominated by variance during later iterations; IW SVM-RFE exhibits significantly lower bias, variance and error than SVM-RFE when the number of remaining features approaches 50.

Experiments on Synthetic Data Conclusion: Variance Reduction via Margin Based Instance Weighting better bias-variance tradeoff increased subset stability improved classification accuracy

Experiments on Synthetic Data Observations: the sample size dependency of the performance of SVM-RFE the effectiveness of instance weighting on alleviating such dependency

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Experiments on Real-world Data Microarray Data: Methods in Comparison: SVM-RFE Ensemble SVM-RFE Instance Weighting SVM-RFE 10 fold ... Training Data Test Data 10-time 10-fold Cross-Validation Bootstrapped Training Data 100 ... 100 bootstrapped(random repetition) 2/3 Training Data 1/3 Test Data Measures: Subset Stability: Kenchuva Index CV Accuracies (KNN, SVM) Measures: Subset Stability: nPOGR AUC Accuracies (KNN, SVM)

Experiments on Real-world Data Observations: Non-discriminative during early iterations; SVM-RFE sharply increase as # of features approaches 10; IW SVM-RFE shows significantly slower rate of increase. Note: 40 iterations starting from about 1000 features till 10 features remain

Experiments on Real-world Data Observations: Both ensemble and instance weighting approaches improve stability consistently; Ensemble is not as significant as instance weighting; As # of features increases, stability score decreases because of the larger correction factor. Consistent Results showed under random repetition setting and not included here

Experiments on Real-world Data Observations: Instance Weighting enables the selection of more genes with higher frequency Instance Weighting produce much bigger consensus gene signatures

Experiments on Real-world Data Conclusions: Improves stability of feature selection without sacrificing prediction accuracy; Performs much better than ensemble approach and more efficient; Leads to significantly increased stability with slight extra cost of time. Consistent Results showed under random repetition setting(also ReliefF) and results can be found in the dissertation but not included here for conciseness.

Outline Introduction and Motivation Background and Related Work Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Conclusion and Future Work Theoretical Framework for Stable Feature Selection; Empirical Weighting Framework for Stable Feature Selection; Effective and Efficient Margin Based Instance Weighting Approaches; Extensive Study on Proposed Theoretical And Empirical Frameworks; Extensive Study on Proposed Weighting Approaches; Extensive Study on Sample Size Effect on Feature Selection Stability. Future Work: Explore Other Weighting Approaches; Study the Relationship Between Feature Selection and Classification w.r.t. Bias-Variance Properties.

Thank you and Questions?