Feature Selection To avid “curse of dimensionality”

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Feature Selection for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CSIE Dept., National Taiwan University ( 台灣大學 資訊工程系 )
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.
Decision Tree Approach in Data Mining
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
1 Fast Asymmetric Learning for Cascade Face Detection Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008 Chun-Hao Chang 張峻豪 2009/12/01.
Feature Selection Presented by: Nafise Hatamikhah
Planning under Uncertainty
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CS Instance Based Learning1 Instance Based Learning.
Dan Simon Cleveland State University
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
1 Particle Swarm Optimization-based Dimensionality Reduction for Hyperspectral Image Classification He Yang, Jenny Q. Du Department of Electrical and Computer.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.
Data Mining 2, Filter methods T statistic Information Distance Correlation Separability …
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Feature Selection Jain, A.K.; Duin, P.W.; Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.
Uniformed Search (cont.) Computer Science cpsc322, Lecture 6
Hierarchical Clustering: Time and Space requirements
Integer Programming An integer linear program (ILP) is defined exactly as a linear program except that values of variables in a feasible solution have.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Recognition with Expression Variations
Dimensionality reduction
Information Management course
Feature Selection for Pattern Recognition
Advanced Artificial Intelligence Feature Selection
Uniformed Search (cont.) Computer Science cpsc322, Lecture 6
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Boosting Nearest-Neighbor Classifier for Character Recognition
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Hidden Markov Models Part 2: Algorithms
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
SMEM Algorithm for Mixture Models
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Dimensionality Reduction
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Feature Selection To avid “curse of dimensionality” In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small number of discriminative features To avid “curse of dimensionality” To reduce feature measurement cost To reduce computational burden Given an nxd pattern matrix (n patterns in d-dimensional feature space), generate an nxm pattern matrix, where m << d

Feature Selection vs. Extraction Both are collectively known as dimensionality reduction Selection: choose a best subset of size m from the available d features Extraction: given d features (set Y), extract m new features (set X) by linear or non-linear combination of all the d features Linear feature extraction: X = TY, where T is a mxd matrix Non-linear feature extraction: X = f(Y) New features by extraction may not have physical interpretation/meaning Examples of linear feature extraction Unsupervised: PCA; Supervised: LDA/MDA Criteria for selection/extraction: either improve or maintain the classification accuracy, simplify classifier complexity

Feature Selection How to find the best subset of size m? Recall, best means classifier based on these m features has the lowest probability of error of all such classifiers Simplest approach is to do an exhaustive search; computationally prohibitive For d=24 and m=12, there are about 2.7 million possible feature subsets! Cover & Van Campenhout (IEEE SMC, 1977) showed that to guarantee the best subset of size m from the available set of size d, one must examine all possible subsets of size m Heuristics have been used to avoid exhaustive search How to evaluate the subsets? Error rate; but then which classifier should be used? Distance measure; Mahalanobis, divergence,… Feature selection is an optimization problem

Feature Selection: Evaluation, Application, and Small Sample Performance (Jain & Zongker, IEEE Trans. PAMI, Feb 1997) Value of feature selection in combining features from different data models Potential difficulties feature selection faces in small sample size situation Let Y be the original set of features and X is the selected subset Feature selection criterion function for the set X is J(X); large values of J indicates better feature subset; problem is to find subset X such that

Taxonomy of Feature Selection Algorithms

Deterministic Single-Solution Methods Begin with a single solution (feature subset) & iteratively add or remove features until some termination criterion is met Also known as sequential methods; most popular Bottom up/forward methods: begin with an empty set & add features Top-down/backward methods: begin with a full set & delete features Since they do not examine all possible subsets, no guarantee of finding the optimal subset Pudil introduced two floating selection methods: SFFS, SFBS 15 feature selection methods listed in Table 1 were evaluated

Sequential Forward Selection (SFS) Start with empty set, X=0 Repeatedly add most significant feature with respect to X Disadvantage: Once a feature is retained, it cannot be discarded; nesting problem

Sequential Backward Selection (SBS) Start with full set, X=Y Repeatedly delete least significant feature in X Disadvantage: SBS requires more computation than SFS; Nesting problem

Generalized Sequential Forward Selection (GSFS(m)) Start with empty set, X=0 Repeatedly add most significant m-subset of (Y - X) (found through exhaustive search)

Generalized Sequential Backward Selection (GSBS(m)) Start with empty set, X=Y Repeatedly delete least significant m- subset of X (found through exhaustive search)

Sequential Forward Floating Selection (SFFS) Step 1: Inclusion. Select the most significant feature with respect to X and add it to X. Continue to step 2. Step 2: Conditional exclusion. Find the least significant feature k in X. If it is the feature just added, then keep it and return to step 1. Otherwise, exclude the feature k. Note that X is now better than it was before step 1. Continue to step 3. Step 3: Continuation of conditional exclusion. Again find the least significant feature in X. If its removal will (a) leave X with at least 2 features, and (b) the value of J(X) is greater than the criterion value of the best feature subset of that size found so far, then remove it and repeat step 3. When these two conditions cease to be satisfied, return to step 1.

Experimental Results 20-dimensional 2-class Gaussian data with the same covariance matrix Goodness of features is measured by Mahalanobis distance Forward search methods are faster than its backward counterpart Performance of floating method is comparable to Branch & bound methods, but they are faster

Selection of Texture Features Selection of texture features for classifying Synthetic Aperture Radar (SAR) images A total of 18 different features were extracted from 4 different models Can classification error be reduced by feature selection 22,000 samples (pixels) from 5 classes; equally split for training & test

Performance of SFFS on Texture Features Best individual texture model for this data is the MAR model Pooling features from different models and then applying feature selection results in an accuracy of 89.3% by 1NN method The selected subset has representative feature from every model

Effect of Training Set Size on Feature Selection Suppose the criterion function is Mahalanobis distance; how would the error is estimating the covariance matrix under small sample size will affect the feature selection performance Run feature selection on the Trunk data with varying sample size 20-dim data from distributions in (2) and (3); n varied from 10 to 5,000 Feature selection quality: no. of common features in the subset selected by SFFS and by the optimal method For n=20, B&B selected the subset {1,2,4,7,9,12,13,14,15,18}; optimal subset is {1,2,3,4,5,6,7,8,9,10}