Presentation is loading. Please wait.

Presentation is loading. Please wait.

Minimum Redundancy and Maximum Relevance Feature Selection

Similar presentations


Presentation on theme: "Minimum Redundancy and Maximum Relevance Feature Selection"— Presentation transcript:

1 Minimum Redundancy and Maximum Relevance Feature Selection
Hang Xiao

2 Background Feature a feature is an individual measurable heuristic property of a phenomenon being observed In character recognition: horizontal and vertical profiles, number of internal holes, stroke detection In speech recognition: noise ratios, length of sounds, relative power, filter matches In microarray : genes expression In machine learning and pattern recognition

3 Background Relevance between features Correlation F-statistic
Mutual information The mutual information is widely used measures to define dependency of variables Individual (H(X),H(Y)), joint (H(X,Y)), and conditional entropies for a pair of correlated subsystems X,Y with mutual information I(X; Y). p(x,y) : joint distribution function of X and Y p(x), p(y) : marginal probability distribution functions Independent : p(x,y) = p(x)p(y)  I(x,y) = 0

4 Feature Selection Problem
Maximal relevance selecting the features with the highest relevance to the target class c, based on mutual info., F-test, etc. without considering relationships among features Minimal Redundancy Selected features are correlated Selected features cover narrow regions in space find a subspace, Rm, of m features, that “optimally” characterizes c Minimal classification error Maximal dependency By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models

5 mRMR: Discrete Variables
Maximize Relevance: S is the set of features I(i,j) is mutual information between feature i and j Minimal Redundancy:

6 mRMR: Continuous Variables
Maximum relevance: F-statistic F(i,h) Minimum redundancy : Correlation cor(i,j)

7 Combine Relevance and Redundancy
Additive combination Multiplicative combination

8 Most Related Methods Most used feature selection methods: top-ranking features without considering relationships among features. Yu & Liu, 2003/2004. information gain, essentially similar approach Wrapper: not filter approach, classifier-involved and thus features do not generalize well PCA and ICA: Feature are orthogonal or independent, but not in the original feature space

9 Class Prediction Methods
Naive Bayes (NB) classifier {g1, g2, …, gm} gene expression level p(gi|hk) is conditional table (density) Support Vector Machine SVM Draw an optimal hyperplane in the feature vector space NB has been shown to have good classification performance for many real data sets, especially for documents SVM, use kernal to create non-linear decision boundary. Multiple classes, one-against others or all-against-all, LIBSVM (one-again-others approach) LDA,

10 Class Prediction Methods
Linear Discriminant Analysis (LDA) Find a linear combination of feature ANOVA , regression analysis Logistic Regression (LR) a linear combination of the feature variables transformed into probabilities by a logistic function

11 Microarray Gene Expression Data Sets for Cancer Classification

12 LOOCV : Leave- One-Out Cross Validation
CV accuracy provides more realistic assess- ment of classifiers which generalize well to unseen data. For presentation clarity, we give the number of LOOCV errors in Tables 4–8 LOOCV : Leave- One-Out Cross Validation Baseline feature : based solely on maximum relevance

13 The role of redundancy reduction
Relevance VI, and Redundancy for MRMR features on discretized NCI dataset. The respective LOOCV errors obtained using the Naive Bayes classifier

14 Do mRMR Features Generalize Well on Unseen Data?
Child Leukemia data (7 classes, 215 training samples, 112 testing samples) testing errors. M is the number of features used in classification

15 What is the Relationship of mRMR Features and Various Data Discretization Schemes?
LOOCV testing results classifier(#error) for binarized NCI and Lymphoma data using SVM classifier.

16 Comparison with other work

17 Theoretical basis of mRMR
Maximum Dependency Criterion Statistic association Definition : mutual information I(Sm,h) Mutual Information For two variables x and y For multivariate variable Sm and the target h

18 High-Dimensional Mutual Information
For multivariate variable Sm and the target h Estimate high-dimensional I(Sm,h) is so difficult An ill-posed problem to find inverse of large co-variance matrix Insufficient number of samples Combinatorial time complex O(C(|Ω|,|S|))

19 Factorize the Mutual Information
Mutual information for multivariate variable Sm and the target h Define: It can be proved:

20 Factorize I(Sm,h) Relevance of S={x1,x2, …} and h, or RL(S,h)
Redundancy among variables {x1,x2,...}, or RD(S) For incremental search, max I(S,h) is “equivalent” to max [RL(S,h) – RD(S)], so called min-Redundancy-Max-Relevance(mRMR)

21 Advantages of mRMR Both relevance and redundancy estimation are low- dimensional problems (i.e. involving only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high- dimensional space! Fast speed More reliable estimation mRMR is an optimal first-order approximation of I(.) maximization Relevance-only ranking only maximizes J(.)!

22 Search Algorithm of mRMR
Greedy search algorithm In the pool Ω, find the variable x1 that has the largest I(x1,h). Exclude x1 from Ω Search x2 so that it maximizes I(x2,h) - ∑I(.,x2)/|Ω| Iterate this process until an expected number of variables have been obtained, or other constraints are satisfied Complexity O(|S|*|Ω|)

23 Comparing Max-Dep and mRMR: Complexity of Feature Selection

24 Comparing Max-Dep and mRMR: Accuracy of Feature Selected in Classification
Leave-One-Out cross validation of feature classification accuracies of mRMR and MaxDep

25 Use Wrappers to Refine Features
mRMR is a filter approach Fast Features might be redundant Independent of the classifier Wrappers seek to minimize the number of errors directly Slow Features are less robust Dependent on classifier Better prediction accuracy Use mRMR first to generate a short feature pool and use wrappers to get a least redundant feature set with better accuracy

26 Use Wrappers to Refine Features
Forward wrappers (incremental selection) Backward wrappers (decremental selection) NCI Data

27 Conclusions The Max-Dependency feature selection can be efficiently implemented as the mRMR algorithm Significantly outperforms the widely used max-relevance selection method: mRMR features cover a broader feature space with less features mRMR is very efficient and useful for gene selection and many other applications. The programs are ready!


Download ppt "Minimum Redundancy and Maximum Relevance Feature Selection"

Similar presentations


Ads by Google