Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection Methods

Similar presentations


Presentation on theme: "Feature Selection Methods"— Presentation transcript:

1 Feature Selection Methods
An overview Thanks to Qiang Yang Modified by Charles Ling Data Mining: Concepts and Techniques

2 What is Feature selection ?
Feature selection: Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest (DIMENSIONALITY REDUCTION) Humans/animals do that constantly! Data Mining: Concepts and Techniques 2/54

3 Motivational example from Biology
[1] Monkeys performing classification task ? N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex. Data Mining: Concepts and Techniques [1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002) 3/54

4 Motivational example from Biology
Monkeys performing classification task Diagnostic features: - Eye separation - Eye height Non-Diagnostic features: - Mouth height - Nose length Data Mining: Concepts and Techniques 4/54

5 Motivational example from Biology
Monkeys performing classification task Results: activity of a population of 150 neurons in the anterior inferior temporal cortex was measured 44 neurons responded significantly differently to at least one feature After Training: 72% (32/44) were selective to one or both of the diagnostic features (and not for the non-diagnostic features) Data Mining: Concepts and Techniques 5/54

6 Motivational example from Biology
Monkeys performing classification task Results: (single neurons) „The data from the present study indicate that neuronal selectivity was shaped by the most relevant subset of features during the categorization training.“ Data Mining: Concepts and Techniques 6/54

7 Data Mining: Concepts and Techniques
feature selection Reducing the feature space by throwing out some of the features (covariates) Also called variable selection Motivating idea: try to find a simple, “parsimonious” model Occam’s razor: simplest explanation that accounts for the data is best Data Mining: Concepts and Techniques

8 Data Mining: Concepts and Techniques
feature extraction Feature Extraction is a process that extract a new set of features from the original data through numerical Functional mapping. Idea: Given data points in d-dimensional space, Project into lower dimensional space while preserving as much information as possible E.g., find best planar approximation to 3D data E.g., find best planar approximation to 104D data Data Mining: Concepts and Techniques

9 Feature Selection vs Feature Extraction
Differs in two ways: Feature selection chooses subset of features Feature extraction creates new features (dimensions) defined as functions over all features F F‘ F F‘ Data Mining: Concepts and Techniques

10 Data Mining: Concepts and Techniques
Outline What is Feature Reduction? Feature Selection Feature Extraction Why need Feature Reduction? Feature Selection Methods Filter Wrapper Feature Extraction Methods Linear Nonlinear Data Mining: Concepts and Techniques

11 Data Mining: Concepts and Techniques
Motivation The objective of feature reduction is three-fold: Improving the prediction performance of the predictors (accuracy) Providing a faster and more cost-effective predictors (CPU time) Providing a better understanding of the underlying process that generated the data (理解) 11 Data Mining: Concepts and Techniques

12 feature reduction--examples
Task 1: classify whether a document is about cats Data: word counts in the document Task 2: predict chances of lung disease Data: medical history survey X X cat 2 and 35 it 20 kitten 8 electric trouble 4 then 5 several 9 feline while lemon Vegetarian No Plays video games Yes Family history Athletic Smoker Sex Male Lung capacity 5.8L Hair color Red Car Audi Weight 185 lbs Reduced X Reduced X cat 2 kitten 8 feline Family history No Smoker Yes Data Mining: Concepts and Techniques

13 Feature reduction in task 1
task 1: We’re interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictor). Text classification Features for all 105 English words, and maybe all word pairs Common practice: throw in every feature you can think of, let feature selection get rid of useless ones Training too expensive with all features The presence of irrelevant features hurts generalization. Data Mining: Concepts and Techniques

14 Feature reduction in task 2
task 2: We’re interested in features—we want to know which are relevant. If we fit a model, it should be interpretable. What causes lung cancer? Features are aspects of a patient’s medical history Binary response variable: did the patient develop lung cancer? Which features best predict whether lung cancer will develop? Might want to legislate against these features. Data Mining: Concepts and Techniques

15 Get at Case 2 through Case 1
Even if we just want to identify features, it can be useful to pretend we want to do prediction. Relevant features are (typically) exactly those that most aid prediction. But not always. Highly correlated features may be redundant but both interesting as “causes”. e.g. smoking in the morning, smoking at night Data Mining: Concepts and Techniques

16 Data Mining: Concepts and Techniques
Outline What is Feature Reduction? Feature Selection Feature Extraction Why need Feature Reduction? Feature Selection Methods Filtering Wrapper Feature Extraction Methods Linear Nonlinear Data Mining: Concepts and Techniques

17 Data Mining: Concepts and Techniques
Filtering methods Basic idea: assign score to each feature f indicating how “related” xf and y are. Intuition: if xi,f=yi for all i, then f is good no matter what our model is—contains all information about y. Many popular scores [see Yang and Pederson ’97] Classification with categorical data: Chi-squared, information gain Can use binning to make continuous data categorical Regression: correlation, mutual information Markov blanket [Koller and Sahami, ’96] Then somehow pick how many of the highest scoring features to keep (nested models) Data Mining: Concepts and Techniques

18 Data Mining: Concepts and Techniques
Filtering methods Advantages: Very fast Simple to apply Disadvantages: Doesn’t take into account which learning algorithm will be used. Doesn’t take into account correlations between features This can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction. Also a significant disadvantage—see homework Suggestion: use light filtering as an efficient initial step if there are many obviously irrelevant features Caveat here too—apparently useless features can be useful when grouped with others Data Mining: Concepts and Techniques

19 Data Mining: Concepts and Techniques
Wrapper Methods Learner is considered a black-box Interface of the black-box is used to score subsets of variables according to the predictive power of the learner when using the subsets. Results vary for different learners One needs to define: how to search the space of all possible variable subsets ? how to assess the prediction performance of a learner ? Data Mining: Concepts and Techniques 19/54

20 Data Mining: Concepts and Techniques
Wrapper Methods The problem of finding the optimal subset is NP-hard! A wide range of heuristic search strategies can be used. Two different classes: Forward selection (start with empty feature set and add features at each step) Backward elimination (start with full feature set and discard features at each step) predictive power is usually measured on a validation set or by cross-validation By using the learner as a black box wrappers are universal and simple! Criticism: a large amount of computation is required. Data Mining: Concepts and Techniques 20/54

21 Data Mining: Concepts and Techniques
Wrapper Methods Data Mining: Concepts and Techniques 21/54

22 Feature selection – search strategy
Method Property Comments Exhaustive search Evaluate all (d^m) possible subsets Guaranteed to find the optimal subset; not feasible for even moderately large values of m and d. Sequential Forward Selection (SFS) Select the best single feature and then add one feature at a time which in combination with the selected features maximize criterion function. Once a feature is retained, it cannot be discarded; computationally attractive since to select a subset of size 2, it examines only (d-1) possible subsets. Sequential Backward Selection (SBS) Start with all the d features and successively delete one feature at a time. Once a feature is deleted, it cannot be brought back into the optimal subset; requires more computation than sequential forward selection. Data Mining: Concepts and Techniques

23 Comparsion of filter and wrapper:
Wrapper method is tied to solving a classification algorithm, hence the criterion can be optimaized; but it is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. Filtering method is much faster but it do not incorporate learning. 23 Data Mining: Concepts and Techniques

24 Multivariate FS is complex
Multivariate FS implies a search in the space of all possible combinations of features. For n features, there are 2^n possible subsets of features. This yields both to a high computational and statistical complexity. Wrappers use the performance of a learning machine to evaluate each subset. Training 2^n learning machines is infeasible for large n, so most wrapper algorithms resort to greedy or heuristic search. The statistical complexity of the learning problem is governed by the ratio n/m where m is the number of examples, but can be reduced to log n/m for regularized algorithms and forward or backward selection procedures. Filters function analogously to wrappers, but they use in the evaluation function something cheaper to compute than the performance of the target learning machine (e.g. a correlation coefficient or the performance of a simpler learning machine). Embedded methods Kohavi-John, 1997 n features, 2n possible feature subsets! Data Mining: Concepts and Techniques

25 Data Mining: Concepts and Techniques
In practice… Univariate feature selection often yields better accuracy results than multivariate feature selection. NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters. Multivariate methods usually claim only better “parsimony”. How can we make multivariate FS work better? Constantin’s comments: There is a tradeoff between feature set compactness (parsimony) and Classification performance that in practical settings makes a difference. For example in text categorization, univariate filtering is best with respect to Classification performance but horrible in parsimony. You cannot use univariate FS To build for example boolean queries for Pubmed because they would not fit The interface input size restrictions. But you can use decision tree FS or Markov Blankets and get Workable boolean queries that are of almost same classification performance as Best univariate set. Etc. NIPS 2003 and WCCI 2006 challenges : Data Mining: Concepts and Techniques

26 Feature Extraction-Definition
Given a set of features the Feature Extraction(“Construction”) problem is is to map F to some feature set that maximizes the learner’s ability to classify patterns. (again ) * This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables) * here is the set of all possible feature sets Data Mining: Concepts and Techniques 26/51

27 Linear, Unsupervised Feature Selection
Question: Are attributes A1 and A2 independent? If they are very dependent, we can remove either A1 or A2 If A1 is independent on a class attribute A2, we can remove A1 from our training data Data Mining: Concepts and Techniques

28 Chi-Squared Test (cont.)
Question: Are attributes A1 and A2 independent? These features are nominal valued (discrete) Null Hypothesis: we expect independence Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

29 The Weather example: Observed Count
temperature Outlook High Low Outlook Subtotal Sunny 2 Cloudy 1 Temperature Subtotal: Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

30 The Weather example: Expected Count
If attributes were independent, then the subtotals would be Like this (this table is also known as temperature Outlook High Low Subtotal Sunny 3*2/3*2/3=4/3=1.3 3*2/3*1/3=2/3=0.6 2 (prob=2/3) Cloudy 3*2/3*1/3=0.6 3*1/3*1/3=0.3 1, (prob=1/3) Subtotal: 1 (prob=1/3) Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

31 Question: How different between observed and expected?
If Chi-squared value is very large, then A1 and A2 are not independent  that is, they are dependent! Degrees of freedom: if table has n*m items, then freedom = (n-1)*(m-1) In our example Degree = 1 Chi-Squared=? Data Mining: Concepts and Techniques

32 Chi-Squared Table: what does it mean?
If calculated value is much greater than in the table, then you have reason to reject the independence assumption When your calculated chi-square value is greater than the chi2 value shown in the 0.05 column (3.84) of this table  you are 95% certain that attributes are actually dependent! i.e. there is only a 5% probability that your calculated X2 value would occur by chance Data Mining: Concepts and Techniques

33 Data Mining: Concepts and Techniques
Example Revisited ( We don’t have to have two-dimensional count table (also known as contingency table) Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, But, the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Observed Honours Male Female Total 40 80 120 Data Mining: Concepts and Techniques

34 Expected (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)
Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Note: the expected is filled in, from 1:1 expectation, instead of calculated Expected Honours Male Female Total 60 120 Data Mining: Concepts and Techniques

35 Chi-Squared Calculation
Female Male Total Observed numbers (O) 80 40 120 Expected numbers (E) 60 O - E 20 -20 (O-E)2 400 (O-E)2 / E 6.67 Sum=13.34 = X2 Data Mining: Concepts and Techniques

36 Chi-Squared Test (Cont.)
Then, check the chi-squared table for significance Compare our X2 value with a c2 (chi squared) value in a table of c2 with n-1 degrees of freedom n is the number of categories, i.e. 2 in our case -- males and females). We have only one degree of freedom (n-1). From the c2 table, we find a "critical value of 3.84 for p = 0.05. 13.34 > 3.84, and the expectation (that the Male:Female in honours major are 1:1) is wrong! Data Mining: Concepts and Techniques

37 Chi-Squared Test in Weka: weather.nominal.arff
Data Mining: Concepts and Techniques

38 Chi-Squared Test in Weka
Data Mining: Concepts and Techniques

39 Chi-Squared Test in Weka
Data Mining: Concepts and Techniques

40 Data Mining: Concepts and Techniques
Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 2 Class 1 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} > Data Mining: Concepts and Techniques

41 Unsupervised Feature Extraction: PCA
Given N data vectors (samples) from k-dimensions (features), find c <= k orthogonal dimensions that can be best used to represent the data Feature set is reduced from k to c Example: data=collection of s; k=100 word counts; c=10 new features The original data set is reduced by projecting the N data vectors on c principal components (reduced dimensions) Each (old) data vector Xj is a linear combination of the c principal component vectors Y1, Y2, … Yc through weights Wi: Xj= m+W1*Y1+W2*Y2+…+Wc*Yc, i=1, 2, … N m is the mean of the data set W1, W2, … are the ith components Y1, Y2, … are the ith Eigen vectors Works for numeric data only Used when the number of dimensions is large Data Mining: Concepts and Techniques

42 Data Mining: Concepts and Techniques
Principal Component Analysis See online tutorials such as X2 Y1 Y2 x Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable. X1 Key observation: variance = largest! Data Mining: Concepts and Techniques

43 Data Mining: Concepts and Techniques
Principle Component Analysis (PCA) Principle Component Analysis: project onto subspace with the most variance (unsupervised; doesn’t take y into account) Data Mining: Concepts and Techniques 43

44 Principal Component Analysis: one attribute first
Temperature 42 40 24 30 15 18 35 Question: how much spread is in the data along the axis? (distance to the mean) Variance=Standard deviation^2 Data Mining: Concepts and Techniques

45 Now consider two dimensions
X=Temperature Y=Humidity 40 90 30 15 70 Covariance: measures the correlation between X and Y cov(X,Y)=0: independent Cov(X,Y)>0: move same dir Cov(X,Y)<0: move oppo dir Data Mining: Concepts and Techniques

46 More than two attributes: covariance matrix
Contains covariance values between all possible dimensions (=attributes): Example for three attributes (x,y,z): Data Mining: Concepts and Techniques

47 Background: eigenvalues AND eigenvectors
Eigenvectors e : C e = e How to calculate e and : Calculate det(C-I), yields a polynomial (degree n) Determine roots to det(C-I)=0, roots are eigenvalues  Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & Sons Or any math packages such as MATLAB Data Mining: Concepts and Techniques

48 Data Mining: Concepts and Techniques
Steps of PCA Calculate eigenvalues  and eigenvectors e for covariance matrix C: Eigenvalues j corresponds to variance on each component j Thus, sort by j Take the first n eigenvectors ei; where n is the number of top eigenvalues These are the directions with the largest variances Data Mining: Concepts and Techniques

49 Data Mining: Concepts and Techniques
An Example Mean1=24.1 Mean2=53.8 X1 X2 X1' X2' 19 63 -5.1 9.25 39 74 14.9 20.25 30 87 5.9 33.25 23 -30.75 15 35 -9.1 -18.75 43 -10.75 32 -21.75 73 19.25 Data Mining: Concepts and Techniques

50 Data Mining: Concepts and Techniques
Covariance Matrix 75 106 482 C= Using MATLAB, we find out: Eigenvectors: e1=(-0.98,-0.21), 1=51.8 e2=(0.21,-0.98), 2=560.2 Thus the second eigenvector is more important! Data Mining: Concepts and Techniques

51 If we only keep one dimension: e2
yi -10.14 -16.72 -31.35 31.374 16.464 8.624 19.404 -17.63 We keep the dimension of e2=(0.21,-0.98) We can obtain the final data as Data Mining: Concepts and Techniques

52 Using Matlab to figure it out
Data Mining: Concepts and Techniques

53 Data Mining: Concepts and Techniques
PCA in Weka Data Mining: Concepts and Techniques

54 Wesather Data from UCI Dataset (comes with weka package)
Data Mining: Concepts and Techniques

55 Data Mining: Concepts and Techniques

56 Data Mining: Concepts and Techniques
Summary of PCA PCA is used for reducing the number of numerical attributes The key is in data transformation Adjust data by mean Find eigenvectors for covariance matrix Transform data Note: only linear combination of data (weighted sum of original data) Data Mining: Concepts and Techniques

57 Data Mining: Concepts and Techniques
Summary Data preparation is a big issue for data mining Data preparation includes transformation, which are: Data sampling and feature selection Discretization Missing value handling Incorrect value handling Feature Selection and Feature Extraction Data Mining: Concepts and Techniques


Download ppt "Feature Selection Methods"

Similar presentations


Ads by Google