Yue Han and Lei Yu Binghamton University.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Aggregating local image descriptors into compact codes
Random Forest Predrag Radenković 3237/10
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Fast Algorithms For Hierarchical Range Histogram Constructions
V ARIANCE R EDUCTION FOR S TABLE F EATURE S ELECTION Presenter: Yue Han Advisor: Lei Yu Department of Computer Science 10/27/10.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Universal Learning over Related Distributions and Adaptive Graph Transduction Erheng Zhong †, Wei Fan ‡, Jing Peng*, Olivier Verscheure ‡, and Jiangtao.
Minimum Redundancy and Maximum Relevance Feature Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Chapter 2 Dimensionality Reduction. Linear Methods
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Consensus Group Stable Feature Selection
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Ensemble Methods in Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Classification Ensemble Methods 1
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Stable Feature Selection: Theory and Algorithms
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Introduction Feature Extraction Discussions Conclusions Results
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Model generalization Brief summary of methods
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Yue Han and Lei Yu Binghamton University

 Introduction, Motivation and Related Work  Theoretical Framework  Empirical Framework : Margin Based Instance Weighting  Empirical Study ◦ Synthetic Data ◦ Real-world Data  Conclusion and Future Work

D1D1 D2D2 Sports T 1 T 2 ….…… T N 12 0 ….…… 6 DMDM C Travel Jobs … …… Terms Documents 3 10 ….…… ….…… 16 … Features(Genes or Proteins) Samples Pixels Vs Features

Stability of Feature Selection : the insensitivity of the result of a feature selection algorithm to variations to the training set. Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently. Stability of Learning Algorithm Learning Algorithm Training Data Learning Models Training D 1 Data Space Training D 2 Training D n Feature Subset R 1 Feature Subset R 2 Feature Subset R n Feature Selection Method Consistent or not??? Stability Issue of Feature Selection

Data Variations Stable Feature Selection Method Stable Feature SubsetLearning Results Learning Methods Closer to characteristic features (biomarkers) Better learning performance Largely different feature subsets Similarly good learning performance Domain experts (Biomedicine and Biology) also interested in: Biomarkers stable and insensitive to data variations Unstable feature selection method Dampen confidence for validation; Increase the cost for experiments

How to represent the underlying data distribution without increasing sample size? Challenge: Challenge: Increasing training sample size could be very costly or impractical Training D 1 Feature Weight Vector Data Space True Feature Weight Vector Feature Weight Vector Training D 2 Training D n Variance: fluctuation of n weight values around its central tendency Bias: loss of central tendency(average) from the true weight value Error: average loss of n weight values from the true weight value for

Error: Data Space: ; Training Data: D ; FS Result: r(D) ; True FS Result: r* Bias: Variance: Bias-Variance Decomposition of Feature Selection Error: o Reveals relationship between accuracy(opposite of error) and stability (opposite of variance); o Suggests a better trade-off between the bias and variance of feature selection. For each individual feature: weight value instead of 0/1 selection Average for all features

Feature Selection (Weighting)  Monte Carlo Estimator Reducing Variance of Monte Carlo Estimator: Importance Sampling ? Increasing sample size impractical and costly Importance Sampling Instance Weighting Intuition behind importance sampling: More instances draw from important regions Less instances draw from other regions Intuition behind instance weighting: Increase weights for instances from important regions Decrease weights for instances from other regions How to weight the instances? How important is each instance?

Challenges:  How to produce weights for instances from the point view of feature selection stability;  How to present weighted instances to conventional feature selection algorithms. Margin Based Instance Weighting for Stable Feature Selection

Original Space For each Margin Vector Feature Space Hypothesis Margin (along each dimension): hitmiss Nearest Hit Nearest Miss captures the local profile of feature relevance for all features at  Instances exhibit different profiles of feature relevance;  Instances influence feature selection results differently.

Hypothesis-Margin based Feature Space Transformation: (a) Original Feature Space (b) Margin Vector Feature Space. (a) (b)

Instance exhibits different profiles of feature relevance influence feature selection results differently Instance Weighting Higher Outlying Degree Lower Weight Lower Outlying Degree Higher Weight Review: Variance reduction via Importance Sampling  More instances draw from important regions  Less instances draw from other regions Weighting:Outlying Degree:

Time Complexity Analysis: o Dominated by Instance Weighting: o Efficient for High-dimensional Data with small sample size (n<<d)

Average Pair-wise Similarity: Kuncheva Index: Training D 1 Data Space Training D 2 Training D n Feature Subset R 1 Feature Subset R 2 Feature Subset R n Feature Selection Method Consistent or not??? Stability Issue of Feature Selection

Synthetic Data Generation : Feature Value: two multivariate normal distributions Covariance matrix is a 10*10 square matrix with elements 1 along the diagonal and 0.8 off diagonal. 100 groups and 10 feature each Class label: a weighted sum of all feature values with optimal feature weight vector 500 Training Data: 100 instances with 50 from and 50 from Leave-one-out Test Data: 5000 instances Method in Comparison: SVM-RFE: Recursively eliminate 10% features of previous iteration till 10 features remained. Measures: Variance, Bias, Error Subset Stability (Kuncheva Index) Accuracy (SVM)

Observations :  Error is equal to the sum of bias and variance for both versions of SVM-RFE;  Error is dominated by bias during early iterations and is dominated by variance during later iterations;  IW SVM-RFE exhibits significantly lower bias, variance and error than SVM-RFE when the number of remaining features approaches 50.

Conclusion: Variance Reduction via Margin Based Instance Weighting  better bias-variance tradeoff  increased subset stability  improved classification accuracy

Microarray Data: Methods in Comparison: SVM-RFE Ensemble SVM-RFE Instance Weighting SVM-RFE Measures: Variance Subset Stability Accuracies (KNN, SVM) Bootstrapped Training Data Feature Subset Aggregated Feature Subset Bootstrapped Training Data... Feature Subset 20-Ensemble SVM-RFE

Note: 40 iterations starting from about 1000 features till 10 features remain Observations:  Non-discriminative during early iterations;  SVM-RFE sharply increase as # of features approaches 10;  IW SVM-RFE shows significantly slower rate of increase.

Observations:  Both ensemble and instance weighting approaches improve stability consistently;  Ensemble is not as significant as instance weighting;  As # of features increases, stability score decreases because of the larger correction factor.

Conclusions:  Improves stability of feature selection without sacrificing prediction accuracy;  Performs much better than ensemble approach and more efficient;  Leads to significantly increased stability with slight extra cost of time. Prediction accuracy(via both KNN and SVM): non-discriminative among three approaches for all data sets

Accomplishments:  Establish a bias-variance decomposition framework for feature selection;  Propose an empirical framework for stable feature selection;  Develop an efficient margin-based instance weighting algorithm;  Comprehensive study through synthetic and real-world data. Future Work:  Extend current framework to other state-of-the-art feature selection algorithms;  Explore the relationship between stable feature selection and classification performance.

Comparison of Feature Selection Algorithms w.r.t. Stability (Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007) Quantify the stability in terms of consistency on subset or weight; Algorithms varies on stability and equally well for classification; Choose the best with both stability and accuracy. Bagging-based Ensemble Feature Selection (Saeys et al. ECML07) Different bootstrapped samples of the same training set; Apply a conventional feature selection algorithm; Aggregates the feature selection results. Group-based Stable Feature Selection (Yu et al. KDD08; Loscalzo et al. KDD09) Explore the intrinsic feature correlations; Identify groups of correlated features; Select relevant feature groups.

Thank you and Questions?