Feature Selection Methods

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Variable - / Feature Selection in Machine Learning (Review)
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
What is Statistical Modeling
Feature Selection [slides prises du cours cs UC Berkeley (2006 / 2009)]
Feature Selection Presented by: Nafise Hatamikhah
Lecture 7: Principal component analysis (PCA)
Data Mining: Concepts and Techniques — Chapter 3 — Cont.
x – independent variable (input)
Reduced Support Vector Machine
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Chapter 3 Data Exploration and Dimension Reduction 1.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Learning from Observations Chapter 18 Through
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Principal Components Analysis ( PCA)
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Machine Learning: Ensemble Methods
Unsupervised Learning
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Topic 10 - Linear Regression
LECTURE 11: Advanced Discriminant Analysis
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality Reduction
Correlation – Regression
Chapter 25 Comparing Counts.
Principal Component Analysis (PCA)
Principal Components Analysis
Advanced Artificial Intelligence Feature Selection
Machine Learning Dimensionality Reduction
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
Principal Component Analysis
Feature Selection To avid “curse of dimensionality”
Nonparametric Statistics
Machine Learning Workshop August 23rd, 2007 Alex Shyr
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Descriptive Statistics vs. Factor Analysis
Multivariate Statistics
Linear Model Selection and regularization
Dimensionality Reduction
Machine Learning in Practice Lecture 23
Feature space tansformation methods
Machine Learning in Practice Lecture 22
Generally Discriminant Analysis
Chapter 26 Comparing Counts.
LECTURE 09: DISCRIMINANT ANALYSIS
Chapter 7: Transformations
Feature Selection Methods
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Chapter 26 Comparing Counts.
Data Pre-processing Lecture Notes for Chapter 2
Decision Trees Jeff Storey.
Marios Mattheakis and Pavlos Protopapas
Unsupervised Learning
Presentation transcript:

Feature Selection Methods An overview Thanks to Qiang Yang Modified by Charles Ling Data Mining: Concepts and Techniques

What is Feature selection ? Feature selection: Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest (DIMENSIONALITY REDUCTION) Humans/animals do that constantly! Data Mining: Concepts and Techniques 2/54

Motivational example from Biology [1] Monkeys performing classification task ? N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex. Data Mining: Concepts and Techniques [1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002) 3/54

Motivational example from Biology Monkeys performing classification task Diagnostic features: - Eye separation - Eye height Non-Diagnostic features: - Mouth height - Nose length Data Mining: Concepts and Techniques 4/54

Motivational example from Biology Monkeys performing classification task Results: activity of a population of 150 neurons in the anterior inferior temporal cortex was measured 44 neurons responded significantly differently to at least one feature After Training: 72% (32/44) were selective to one or both of the diagnostic features (and not for the non-diagnostic features) Data Mining: Concepts and Techniques 5/54

Motivational example from Biology Monkeys performing classification task Results: (single neurons) „The data from the present study indicate that neuronal selectivity was shaped by the most relevant subset of features during the categorization training.“ Data Mining: Concepts and Techniques 6/54

Data Mining: Concepts and Techniques feature selection Reducing the feature space by throwing out some of the features (covariates) Also called variable selection Motivating idea: try to find a simple, “parsimonious” model Occam’s razor: simplest explanation that accounts for the data is best Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques feature extraction Feature Extraction is a process that extract a new set of features from the original data through numerical Functional mapping. Idea: Given data points in d-dimensional space, Project into lower dimensional space while preserving as much information as possible E.g., find best planar approximation to 3D data E.g., find best planar approximation to 104D data Data Mining: Concepts and Techniques

Feature Selection vs Feature Extraction Differs in two ways: Feature selection chooses subset of features Feature extraction creates new features (dimensions) defined as functions over all features F F‘ F F‘ Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Outline What is Feature Reduction? Feature Selection Feature Extraction Why need Feature Reduction? Feature Selection Methods Filter Wrapper Feature Extraction Methods Linear Nonlinear Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Motivation The objective of feature reduction is three-fold: Improving the prediction performance of the predictors (accuracy) Providing a faster and more cost-effective predictors (CPU time) Providing a better understanding of the underlying process that generated the data (理解) 11 Data Mining: Concepts and Techniques

feature reduction--examples Task 1: classify whether a document is about cats Data: word counts in the document Task 2: predict chances of lung disease Data: medical history survey X X cat 2 and 35 it 20 kitten 8 electric trouble 4 then 5 several 9 feline while … lemon Vegetarian No Plays video games Yes Family history Athletic Smoker Sex Male Lung capacity 5.8L Hair color Red Car Audi … Weight 185 lbs Reduced X Reduced X cat 2 kitten 8 feline Family history No Smoker Yes Data Mining: Concepts and Techniques

Feature reduction in task 1 task 1: We’re interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictor). Text classification Features for all 105 English words, and maybe all word pairs Common practice: throw in every feature you can think of, let feature selection get rid of useless ones Training too expensive with all features The presence of irrelevant features hurts generalization. Data Mining: Concepts and Techniques

Feature reduction in task 2 task 2: We’re interested in features—we want to know which are relevant. If we fit a model, it should be interpretable. What causes lung cancer? Features are aspects of a patient’s medical history Binary response variable: did the patient develop lung cancer? Which features best predict whether lung cancer will develop? Might want to legislate against these features. Data Mining: Concepts and Techniques

Get at Case 2 through Case 1 Even if we just want to identify features, it can be useful to pretend we want to do prediction. Relevant features are (typically) exactly those that most aid prediction. But not always. Highly correlated features may be redundant but both interesting as “causes”. e.g. smoking in the morning, smoking at night Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Outline What is Feature Reduction? Feature Selection Feature Extraction Why need Feature Reduction? Feature Selection Methods Filtering Wrapper Feature Extraction Methods Linear Nonlinear Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Filtering methods Basic idea: assign score to each feature f indicating how “related” xf and y are. Intuition: if xi,f=yi for all i, then f is good no matter what our model is—contains all information about y. Many popular scores [see Yang and Pederson ’97] Classification with categorical data: Chi-squared, information gain Can use binning to make continuous data categorical Regression: correlation, mutual information Markov blanket [Koller and Sahami, ’96] Then somehow pick how many of the highest scoring features to keep (nested models) Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Filtering methods Advantages: Very fast Simple to apply Disadvantages: Doesn’t take into account which learning algorithm will be used. Doesn’t take into account correlations between features This can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction. Also a significant disadvantage—see homework Suggestion: use light filtering as an efficient initial step if there are many obviously irrelevant features Caveat here too—apparently useless features can be useful when grouped with others Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Wrapper Methods Learner is considered a black-box Interface of the black-box is used to score subsets of variables according to the predictive power of the learner when using the subsets. Results vary for different learners One needs to define: how to search the space of all possible variable subsets ? how to assess the prediction performance of a learner ? Data Mining: Concepts and Techniques 19/54

Data Mining: Concepts and Techniques Wrapper Methods The problem of finding the optimal subset is NP-hard! A wide range of heuristic search strategies can be used. Two different classes: Forward selection (start with empty feature set and add features at each step) Backward elimination (start with full feature set and discard features at each step) predictive power is usually measured on a validation set or by cross-validation By using the learner as a black box wrappers are universal and simple! Criticism: a large amount of computation is required. Data Mining: Concepts and Techniques 20/54

Data Mining: Concepts and Techniques Wrapper Methods Data Mining: Concepts and Techniques 21/54

Feature selection – search strategy Method Property Comments Exhaustive search Evaluate all (d^m) possible subsets Guaranteed to find the optimal subset; not feasible for even moderately large values of m and d. Sequential Forward Selection (SFS) Select the best single feature and then add one feature at a time which in combination with the selected features maximize criterion function. Once a feature is retained, it cannot be discarded; computationally attractive since to select a subset of size 2, it examines only (d-1) possible subsets. Sequential Backward Selection (SBS) Start with all the d features and successively delete one feature at a time. Once a feature is deleted, it cannot be brought back into the optimal subset; requires more computation than sequential forward selection. Data Mining: Concepts and Techniques

Comparsion of filter and wrapper: Wrapper method is tied to solving a classification algorithm, hence the criterion can be optimaized; but it is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. Filtering method is much faster but it do not incorporate learning. 23 Data Mining: Concepts and Techniques

Multivariate FS is complex Multivariate FS implies a search in the space of all possible combinations of features. For n features, there are 2^n possible subsets of features. This yields both to a high computational and statistical complexity. Wrappers use the performance of a learning machine to evaluate each subset. Training 2^n learning machines is infeasible for large n, so most wrapper algorithms resort to greedy or heuristic search. The statistical complexity of the learning problem is governed by the ratio n/m where m is the number of examples, but can be reduced to log n/m for regularized algorithms and forward or backward selection procedures. Filters function analogously to wrappers, but they use in the evaluation function something cheaper to compute than the performance of the target learning machine (e.g. a correlation coefficient or the performance of a simpler learning machine). Embedded methods Kohavi-John, 1997 n features, 2n possible feature subsets! Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques In practice… Univariate feature selection often yields better accuracy results than multivariate feature selection. NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters. Multivariate methods usually claim only better “parsimony”. How can we make multivariate FS work better? Constantin’s comments: There is a tradeoff between feature set compactness (parsimony) and Classification performance that in practical settings makes a difference. For example in text categorization, univariate filtering is best with respect to Classification performance but horrible in parsimony. You cannot use univariate FS To build for example boolean queries for Pubmed because they would not fit The interface input size restrictions. But you can use decision tree FS or Markov Blankets and get Workable boolean queries that are of almost same classification performance as Best univariate set. Etc. NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Data Mining: Concepts and Techniques

Feature Extraction-Definition Given a set of features the Feature Extraction(“Construction”) problem is is to map F to some feature set that maximizes the learner’s ability to classify patterns. (again ) * This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables) * here is the set of all possible feature sets Data Mining: Concepts and Techniques 26/51

Linear, Unsupervised Feature Selection Question: Are attributes A1 and A2 independent? If they are very dependent, we can remove either A1 or A2 If A1 is independent on a class attribute A2, we can remove A1 from our training data Data Mining: Concepts and Techniques

Chi-Squared Test (cont.) Question: Are attributes A1 and A2 independent? These features are nominal valued (discrete) Null Hypothesis: we expect independence Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

The Weather example: Observed Count temperature Outlook High Low Outlook Subtotal Sunny 2 Cloudy 1 Temperature Subtotal: Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

The Weather example: Expected Count If attributes were independent, then the subtotals would be Like this (this table is also known as temperature Outlook High Low Subtotal Sunny 3*2/3*2/3=4/3=1.3 3*2/3*1/3=2/3=0.6 2 (prob=2/3) Cloudy 3*2/3*1/3=0.6 3*1/3*1/3=0.3 1, (prob=1/3) Subtotal: 1 (prob=1/3) Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

Question: How different between observed and expected? If Chi-squared value is very large, then A1 and A2 are not independent  that is, they are dependent! Degrees of freedom: if table has n*m items, then freedom = (n-1)*(m-1) In our example Degree = 1 Chi-Squared=? Data Mining: Concepts and Techniques

Chi-Squared Table: what does it mean? If calculated value is much greater than in the table, then you have reason to reject the independence assumption When your calculated chi-square value is greater than the chi2 value shown in the 0.05 column (3.84) of this table  you are 95% certain that attributes are actually dependent! i.e. there is only a 5% probability that your calculated X2 value would occur by chance Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example Revisited (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html) We don’t have to have two-dimensional count table (also known as contingency table) Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, But, the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Observed Honours Male Female Total 40 80 120 Data Mining: Concepts and Techniques

Expected (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html) Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Note: the expected is filled in, from 1:1 expectation, instead of calculated Expected Honours Male Female Total 60 120 Data Mining: Concepts and Techniques

Chi-Squared Calculation Female Male Total Observed numbers (O) 80 40 120 Expected numbers (E) 60 O - E 20 -20 (O-E)2 400 (O-E)2 / E 6.67 Sum=13.34 = X2 Data Mining: Concepts and Techniques

Chi-Squared Test (Cont.) Then, check the chi-squared table for significance http://helios.bto.ed.ac.uk/bto/statistics/table2.html#Chi%20squared%20test Compare our X2 value with a c2 (chi squared) value in a table of c2 with n-1 degrees of freedom n is the number of categories, i.e. 2 in our case -- males and females). We have only one degree of freedom (n-1). From the c2 table, we find a "critical value of 3.84 for p = 0.05. 13.34 > 3.84, and the expectation (that the Male:Female in honours major are 1:1) is wrong! Data Mining: Concepts and Techniques

Chi-Squared Test in Weka: weather.nominal.arff Data Mining: Concepts and Techniques

Chi-Squared Test in Weka Data Mining: Concepts and Techniques

Chi-Squared Test in Weka Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 2 Class 1 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} > Data Mining: Concepts and Techniques

Unsupervised Feature Extraction: PCA Given N data vectors (samples) from k-dimensions (features), find c <= k orthogonal dimensions that can be best used to represent the data Feature set is reduced from k to c Example: data=collection of emails; k=100 word counts; c=10 new features The original data set is reduced by projecting the N data vectors on c principal components (reduced dimensions) Each (old) data vector Xj is a linear combination of the c principal component vectors Y1, Y2, … Yc through weights Wi: Xj= m+W1*Y1+W2*Y2+…+Wc*Yc, i=1, 2, … N m is the mean of the data set W1, W2, … are the ith components Y1, Y2, … are the ith Eigen vectors Works for numeric data only Used when the number of dimensions is large Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Principal Component Analysis See online tutorials such as http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf X2 Y1 Y2 x Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable. X1 Key observation: variance = largest! Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Principle Component Analysis (PCA) Principle Component Analysis: project onto subspace with the most variance (unsupervised; doesn’t take y into account) Data Mining: Concepts and Techniques 43

Principal Component Analysis: one attribute first Temperature 42 40 24 30 15 18 35 Question: how much spread is in the data along the axis? (distance to the mean) Variance=Standard deviation^2 Data Mining: Concepts and Techniques

Now consider two dimensions X=Temperature Y=Humidity 40 90 30 15 70 Covariance: measures the correlation between X and Y cov(X,Y)=0: independent Cov(X,Y)>0: move same dir Cov(X,Y)<0: move oppo dir Data Mining: Concepts and Techniques

More than two attributes: covariance matrix Contains covariance values between all possible dimensions (=attributes): Example for three attributes (x,y,z): Data Mining: Concepts and Techniques

Background: eigenvalues AND eigenvectors Eigenvectors e : C e = e How to calculate e and : Calculate det(C-I), yields a polynomial (degree n) Determine roots to det(C-I)=0, roots are eigenvalues  Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & Sons Or any math packages such as MATLAB Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Steps of PCA Calculate eigenvalues  and eigenvectors e for covariance matrix C: Eigenvalues j corresponds to variance on each component j Thus, sort by j Take the first n eigenvectors ei; where n is the number of top eigenvalues These are the directions with the largest variances Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques An Example Mean1=24.1 Mean2=53.8 X1 X2 X1' X2' 19 63 -5.1 9.25 39 74 14.9 20.25 30 87 5.9 33.25 23 -30.75 15 35 -9.1 -18.75 43 -10.75 32 -21.75 73 19.25 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Covariance Matrix 75 106 482 C= Using MATLAB, we find out: Eigenvectors: e1=(-0.98,-0.21), 1=51.8 e2=(0.21,-0.98), 2=560.2 Thus the second eigenvector is more important! Data Mining: Concepts and Techniques

If we only keep one dimension: e2 yi -10.14 -16.72 -31.35 31.374 16.464 8.624 19.404 -17.63 We keep the dimension of e2=(0.21,-0.98) We can obtain the final data as Data Mining: Concepts and Techniques

Using Matlab to figure it out Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques PCA in Weka Data Mining: Concepts and Techniques

Wesather Data from UCI Dataset (comes with weka package) Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary of PCA PCA is used for reducing the number of numerical attributes The key is in data transformation Adjust data by mean Find eigenvectors for covariance matrix Transform data Note: only linear combination of data (weighted sum of original data) Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary Data preparation is a big issue for data mining Data preparation includes transformation, which are: Data sampling and feature selection Discretization Missing value handling Incorrect value handling Feature Selection and Feature Extraction Data Mining: Concepts and Techniques