Stable Feature Selection: Theory and Algorithms

Stable Feature Selection: Theory and Algorithms
Hello, everyone! Welcome to my dissertation defense. Let’s get started! The topic of my dissertation is about … Presenter: Yue Han Advisor: Lei Yu Ph.D. Dissertation 4/26/2012

Outline Introduction and Motivation Background and Related Work
Major Contributions Publications Theoretical Framework for Stable Feature Selection Empirical Framework : Margin Based Instance Weighting Empirical Study General Experimental Setup Experiments on Synthetic Data Experiments on Real-World Data Conclusion and Future Work

Feature Selection Applications
Pixel Selection Gene Selection Feature selection, not only a preprocessing step to prepare data for mining tasks, but also a knowledge discovery tool to extract valuable information from data. Biologists interested into a subset of genes to explain the observed phenomenon(disease types or symptoms). Researchers in computer graphics interested into a set of expressive pixels to capture the facial expression of humans. Natural language processing engineers interested into a set of representative terms or words to achieve better understanding of the document. Sports Travel Politics Tech Artist Life Science Internet. Business Health Elections Word Selection

Feature Selection from High-dimensional Data
Feature Selection Algorithms Low-Dimensional Data Learning Models p: # of features n: # of samples High-dimensional data: p >> n Curse of Dimensionality: Effects on distance functions In optimization and learning In Bayesian statistics From the examples, we can observe the number of features is huge compared to the number of samples/instances Conventional learning approaches lose effects when being applied to the high-dimensional data directly, curse of dimensionality Instead of … , feature selection is involved to reduce the data dimensionality, here is the … Feature selection can … Feature Selection: Alleviating the effect of the curse of dimensionality. Enhancing generalization capability. Speeding up learning process. Improving model interpretability. Knowledge Discovery on High-dimensional Data

Stability of Feature Selection
Feature Selection Method Training Data Training Data Feature Subset Training Data Feature Subset Consistent or not??? Feature Subset As we introduced, applying a feature selection method on the training data, we get a feature subset. If we have variations to the training data(all drawn from the same data space), feature selection results consistent or not? Stability of feature selection, defined as … Stability of learning algorithm is studied … Unfortunately, stability of feature selection is under-addressed. Stability Issue of Feature Selection Stability of Feature Selection: the insensitivity of the result of a feature selection algorithm to variations to the training set. Training Data Learning Model Learning Algorithm Stability of Learning Algorithm is firstly examined by Turney in 1995 Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently.

Motivation for Stable Feature Selection
Sample Space D Training Data D1 Training Data D2 Given Unlimited Sample Size: Feature selection results from D1 and D2 are the same Training data is usually a subset of sample space and it can not capture the underlying distribution. Increasing sample size is not practical and feasible for application domains like biology. Then consistent results on limited sample size are more convincing than unstable results. People may argue: shall we care about the stability of feature selection? As long as the predictive performance of learning models built on the selected features good enough even if the features are completely different. Biologist cares about … Given Limited Sample Size: (n<<p for high dimensional data) Feature selection results from D1 and D2 are different Biologists cares about: Prediction accuracy & Consistency of feature subsets; Confidence for biological validation ; Biomarkers to explain the observed phenomena.

Feature Selection Methods
Subset Generation Subset Evaluation Stopping Criterion Result Validation Original set Subset Goodness of subset no Yes Traditionally, a feature selection method involves four steps. The subset generation is guided by different search strategies. Feature selection approaches are divided into three categories according to the stopping criterion. A great variety of feature selection algorithms have been proposed, such as … Search Strategies: Complete Search Sequential Search Random Search Evaluation Criteria Filter Model Wrapper Model Embedded Model Representative Algorithms Relief, SFS, MDLM, etc. FSBC, ELSA, LVW, etc. BBHFS, Dash-Liu’s, etc.

Stable Feature Selection
Comparison of Feature Selection Algorithms w.r.t. Stability (Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007) Quantify the stability in terms of consistency on subset or weight; Algorithms varies on stability and equally well for classification; Choose the best with both stability and accuracy. Bagging-based Ensemble Feature Selection (Saeys et al. ECML07) Different bootstrapped samples of the same training set; Apply a conventional feature selection algorithm; Aggregates the feature selection results. Group-based Stable Feature Selection (Yu et al. KDD08; Loscalzo et al. KDD09) Explore the intrinsic feature correlations; Identify groups of correlated features; Select relevant feature groups. There exist very few study on the stability issue of feature selection. Earlier research focused on comparing the stability of different feature selection algorithms and the proposal of new stability measures. In recent years, two approaches are proposed via exploring the sample space and the feature space respectively.

Margin based Feature Selection
Sample Margin: how much can an instance travel before it hits the decision boundary Another line of research related is the margin based feature selection. Two types of margin … Hypothesis Margin: how much can the hypothesis travel before it hits an instance (Distance between the hypothesis and the opposite hypothesis of an instance) Representative Algorithms based on HM: Relief-F, G-flip, Simba, etc. margin is used for feature weighting or feature selection (totally different use in our study)

Publications Yue Han and Lei Yu. Margin Based Sample Weighting for Stable Feature Selection. In Proceedings of the 11th International Conference on Web-Age Information Management (WAIM2010), pages , Jiuzhaigou, China, July 15-17, 2010. Yue Han and Lei Yu. A Variance Reduction Framework for Stable Feature Selection. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM2010), pages , Sydney, Australia, December 14-17, 2010. Lei Yu, Yue Han and Michael E. Berens. Stable Gene Selection from Microarray Data via Sample Weighting. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), pages , vol. 9 no. 1, 2012. Yue Han and Lei Yu. A Variance Reduction Framework for Stable Feature Selection. Statistical Analysis and Data Mining(SADM), Accepted, 2012.

Bias-variance Decomposition of Feature Selection Error
Training Data: D Data Space: FS Result: r(D) True FS Result: r* As what I have explained in the introduction, the stability of feature selection is defined as… Then how to measure the stability of a feature selection algorithm, the consistency of feature selection results is a straightforward way. These three properties can be formulated as the … Bias-Variance Decomposition of Feature Selection Error: Relationship between accuracy(opposite of loss)&stability(opposite of variance); Suggests a better trade-off between the bias and variance of feature selection.

Bias, Variance and Error of Monte Carlo Estimator
Feature Selection (Weighting)  Monte Carlo Estimator Relevance Score: Monte Carlo Estimator: Error: Bias: Variance: If we randomly pick up n samples of training data X, the relevance score is to get the average of relevant scores from different samples. This is Monte Carlo Estimator. Most feature selection algorithms simply aggregate all relevance scores cross all samples so feature selection can be considered as Monte Carlo Estimator. Extend to bias, variance, … Impact Factor: feature selection algorithm and sample size Impractical Costly Increasing Sample size

Variance Reduction via Sample Weighting
Probability density function Importance Sampling A good importance sampling function h(x) Fortunately, there are different variance reduction techniques, importance sampling is one of them. h(x) is hard to estimate but … Importance Sampling More instances draw from important regions Less instances draw from other regions Instance Weighting Increase weights for instances from important regions Decrease weights for instances from other regions

Margin Based Instance Weighting for Stable Feature Selection
Overall Framework Challenges: How to produce weights for instances from the point view of feature selection stability; How to present weighted instances to conventional feature selection algorithms. Margin Based Instance Weighting for Stable Feature Selection

Margin Vector Feature Space
Original Feature Space For each For each Hypothesis Margin: captures the local profile of feature relevance for all features at hit miss Nearest Hit Nearest Miss Instances exhibit different profiles of feature relevance; Instances influence feature selection results differently.

An Illustrative Example
(a) Original Feature Space (b) Margin Vector Feature Space.

Extension for Hypothesis Margin of 1NN
To reduce the effect of noise or outliers Hypothesis Margin of kNN(k>1) weighted kNN(k>1)

Margin Based Instance Weighting Algorithm
exhibits different profiles of feature relevance influence feature selection results differently Instance Weighting Higher Outlying Degree Lower Weight Lower Outlying Degree Higher Weight Review: Variance reduction via Importance Sampling More instances draw from important regions Less instances draw from other regions Weighting: Outlying Degree:

Iterative Margin Based Instance Weighting
Assumption: Instances are equally important in original feature space Original Feature Space Margin Vector Feature Space Weighted Feature Space Margin Vector Feature Space Weighted Feature Space Margin Vector Feature Space The iterative procedure always Converges fast; There exists little difference in terms of learned weights; Overall a stable procedure. Instance Weight Final Instance Weight Updated Instance Weight

Algorithm Illustration
Time Complexity Analysis: Dominated by Instance Weighting: Efficient for High-dimensional Data with small sample size (n<<d)

Objective of Empirical Study
to demonstrate the bias-variance decomposition in theoretical framework; to verify the effectiveness of the proposed instance weighting framework on variance reduction; to study the impacts of variance reduction on the stability and predictive performance of the selected subsets. Training Data Feature Subset Feature Selection Method Consistent or not??? Stability of Feature Selection

Algorithms in Comparison
Baseline Algorithm SVM-RFE Recursively eliminate 10 percent of training features each iteration Linear kernel and default C parameter Instance Weighting SVM-RFE IW SVM-RFE Instance weight affects error penalty and thus the choice of hyperplane Ensemble SVM-RFE En SVM-RFE 20 bootstrapped training set Aggregate different rankings into a final consensus ranking Baseline Algorithm Relief-F Hypothesis Margin (produce feature weights) Simply aggregate margin based on KNN along each feature dimension Instance Weighting Relief-F IW Relief-F Instance weight affects the aggregated feature weight Ensemble Relief-F En Relief-F 20 bootstrapped training set Aggregate different rankings into a final consensus ranking

Stability Measures, Predictive Accuracy Measures
Feature Subset Jaccard Index; nPOGR; SIMv; Kuncheva Index. Feature Ranking: Spearman Coefficient Feature Weighting: Pearson Correlation Coefficient nPOGR: Kuncheva Index: Predictive Accuracy CV Accuracy: Prediction Accuracy base on Cross-validation AUC Accuracy: the area under the receiver operating characteristic (ROC) Curve

Experiments on Synthetic Data
Synthetic Data Generation: Feature Value: two multivariate normal distributions Covariance matrix is a 10*10 square matrix with elements 1 along the diagonal and 0.8 off diagonal. 100 groups and 10 feature each Class label: a weighted sum of all feature values with optimal feature weight vector 500 Training Data: 100 instances with 50 from and 50 from Leave-one-out Test Data: 5000 instances Method in Comparison: SVM-RFE: Recursively eliminate 10% features of previous iteration till 10 features remained. Measures: Variance, Bias, Error Subset Stability (Kuncheva Index) CV Accuracy (SVM)

Observations: Error is equal to the sum of bias and variance for both versions of SVM-RFE; Error is dominated by bias during early iterations and is dominated by variance during later iterations; IW SVM-RFE exhibits significantly lower bias, variance and error than SVM-RFE when the number of remaining features approaches 50.

Conclusion: Variance Reduction via Margin Based Instance Weighting better bias-variance tradeoff increased subset stability improved classification accuracy

Observations: the sample size dependency of the performance of SVM-RFE the effectiveness of instance weighting on alleviating such dependency

Experiments on Real-world Data
Microarray Data: Methods in Comparison: SVM-RFE Ensemble SVM-RFE Instance Weighting SVM-RFE 10 fold ... Training Data Test Data 10-time 10-fold Cross-Validation Bootstrapped Training Data 100 ... 100 bootstrapped(random repetition) 2/3 Training Data 1/3 Test Data Measures: Subset Stability: Kenchuva Index CV Accuracies (KNN, SVM) Measures: Subset Stability: nPOGR AUC Accuracies (KNN, SVM)

Observations: Non-discriminative during early iterations; SVM-RFE sharply increase as # of features approaches 10; IW SVM-RFE shows significantly slower rate of increase. Note: 40 iterations starting from about 1000 features till 10 features remain

Observations: Both ensemble and instance weighting approaches improve stability consistently; Ensemble is not as significant as instance weighting; As # of features increases, stability score decreases because of the larger correction factor. Consistent Results showed under random repetition setting and not included here

Observations: Instance Weighting enables the selection of more genes with higher frequency Instance Weighting produce much bigger consensus gene signatures

Conclusions: Improves stability of feature selection without sacrificing prediction accuracy; Performs much better than ensemble approach and more efficient; Leads to significantly increased stability with slight extra cost of time. Consistent Results showed under random repetition setting(also ReliefF) and results can be found in the dissertation but not included here for conciseness.

Conclusion and Future Work
Theoretical Framework for Stable Feature Selection; Empirical Weighting Framework for Stable Feature Selection; Effective and Efficient Margin Based Instance Weighting Approaches; Extensive Study on Proposed Theoretical And Empirical Frameworks; Extensive Study on Proposed Weighting Approaches; Extensive Study on Sample Size Effect on Feature Selection Stability. Future Work: Explore Other Weighting Approaches; Study the Relationship Between Feature Selection and Classification w.r.t. Bias-Variance Properties.

Thank you and Questions?

Stable Feature Selection: Theory and Algorithms

Similar presentations

Presentation on theme: "Stable Feature Selection: Theory and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stable Feature Selection: Theory and Algorithms

Similar presentations

Presentation on theme: "Stable Feature Selection: Theory and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback