Feature selection based on information theory, consistency and separability indices Włodzisław Duch, Tomasz Winiarski, Krzysztof Grąbczewski, Jacek Biesiada,

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Aggregating local image descriptors into compact codes
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Component Analysis (Review)
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
GhostMiner Wine example Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland, School of.
Measures of Information Hartley defined the first information measure: –H = n log s –n is the length of the message and s is the number of possible values.
Computational Intelligence for Information Selection
x – independent variable (input)
Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.
Mutual Information Mathematical Biology Seminar
Selekcja informacji dla analizy danych z mikromacierzy Włodzisław Duch, Jacek Biesiada Dept. of Informatics, Nicolaus Copernicus University, Google: Duch.
Statistical Methods Chichang Jou Tamkang University.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Reduced Support Vector Machine
Mutual Information for Image Registration and Feature Selection
Machine Learning CMPT 726 Simon Fraser University
Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.
Lecture II-2: Probability Review
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
PATTERN RECOGNITION AND MACHINE LEARNING
A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data Jacek Biesiada Division of Computer Methods, Dept. of Electrotechnology, The Silesian.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Computational Intelligence: Methods and Applications Lecture 37 Summary & review Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Feature Machine for DNA microarray data
Chapter 7. Classification and Prediction
Privacy-Preserving Data Mining
Instance Based Learning
Ch8: Nonparametric Methods
Distributions cont.: Continuous and Multivariate
Computational Intelligence: Methods and Applications
Roberto Battiti, Mauro Brunato
CSE 4705 Artificial Intelligence
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Presentation transcript:

Feature selection based on information theory, consistency and separability indices Włodzisław Duch, Tomasz Winiarski, Krzysztof Grąbczewski, Jacek Biesiada, Adam Kachel Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ICONIP Singapore,

What am I going to say Selection of informationSelection of information Information theory - filtersInformation theory - filters Information theory - selectionInformation theory - selection Consistency indicesConsistency indices Separability indicesSeparability indices Empirical comparison: artificial dataEmpirical comparison: artificial data Empirical comparison: real dataEmpirical comparison: real data Conclusions, or what have we learned?Conclusions, or what have we learned?

Selection of information Attention: basic cognitive skillAttention: basic cognitive skill Find relevant information:Find relevant information: –discard attributes that do not contain information, –use weights to express the relative importance, –create new, more informative attributes –reduce dimensionality aggregating information Ranking: treat each feature as independent. Selection: search for subsets, remove redundant.Ranking: treat each feature as independent. Selection: search for subsets, remove redundant. Filters: universal criteria, model-independent. Wrappers: criteria specific for data models are used.Filters: universal criteria, model-independent. Wrappers: criteria specific for data models are used. Here: filters for ranking and selection.Here: filters for ranking and selection.

Information theory - filters X – vectors, X j – attributes, X j =f attribute values, C i - class i =1.. K, joint probability distribution p(C, X j ). The amount of information contained in this joint distribution, summed over all classes, gives an estimation of feature importance: For continuous attribute values integrals are approximated by sums. This implies discretization into r k (f) regions, an issue in itself. Alternative: fitting p(C i,f) density using Gaussian or other kernels. Which method is more accurate and what are expected errors?

Information gain Information gained by considering the joint probability distribution p(C, f) is a difference between: A feature is more important if its information gain is larger. Modifications of the information gain, frequently used as criteria in decision trees, include: IGR(C,X j ) = IG(C,X j )/I(X j ) the gain ratio IGn(C,X j ) = IG(C,X j )/I(C) an asymmetric dependency coefficient D M (C,X j ) =  IG(C,X j )/I(C,X j ) normalized Mantaras distance

Information indices Information gained considering attribute X j and classes C together is also known as ‘mutual information’, equal to the Kullback-Leibler divergence between joint and product probability distributions: Entropy distance measure is a sum of conditional information: Symmetrical uncertainty coefficient is obtained from entropy distance:

Weighted I(C,X) Joint information should be weighted by p(r k (f )): For continuous attribute values integrals are approximated by sums. This implies discretization into r k (f ) regions, an issue in itself. Alternative: fitting p(C i, f ) density using Gaussian or other kernels. Which method is more accurate and how large are expected errors?

Purity indices Many information-based quantities may be used to evaluate attributes. Consistency or purity-based indices are one alternative. For selection of subset of attributes F={X i } the sum runs over all Cartesian products, or multidimensional partitions r k (F). Advantages: simplest approach both ranking and selection Hashing techniques are used to calculate p(r k (F)) probabilities.

4 Gaussians in 8D Artificial data: set of 4 Gaussians in 8D, 1000 points per Gaussian, each as a separate class. Dimension 1-4, independent, Gaussians centered at: (0,0,0,0), (2,1,0.5,0.25), (4,2,1,0.5), (6,3,1.5,0.75). Ranking and overlapping strength are inversely related: Ranking: X 1  X 2  X 3  X 4. Attributes X i+4 = 2X i + uniform noise ±1.5. Best ranking: X 1  X 5  X 2  X 6  X 3  X 7  X 4  X 8 Best selection: X 1  X 2  X 3  X 4  X 5  X 6  X 7  X 8

Dim X 1 vs. X 2

Dim X 1 vs. X 5

Ranking algorithms WI(C,f) : information from weighted p(r(f))p(C,r(f)) distribution MI(C,f) : mutual information (information gain) ICR(C|f) : information gain ratio IC(C|f) : information from max C posterior distribution GD(C,f) : transinformation matrix with Mahalanobis distance + 7 other methods based on IC and correlation-based distances, Markov blanket and Relieff selection methods.

Selection algorithms Maximize evaluation criterion for single & remove redundant features. 1. MI(C;f)  MI(f;g) algorithm (Battiti 1994) 2. IC(C,f)  IC(f,g), same algorithm but with IC criterion 3. Max IC(C;F) adding single attribute that maximizes IC 4. Max MI(C;F) adding single attribute that maximizes IC 5. SSV decision tree based on separability criterion.

Ranking for 8D Gaussians Partitions of each attribute into 4, 8, 16, 24, 32 parts, with equal width. Methods that found perfect ranking: MI(C;f), IGR(C;f), WI(C,f), GD transinformation distance IC(f) : correct, except for P8, feature 2-6 reversed (6 is the noisy version of 2). Other, more sophisticated algorithms, made more errors. Selection for Gaussian distributions is rather easy using any evaluation measure. Simpler algorithms work better.

Selection for 8D Gaussians Partitions of each attribute into 4, 8, 16, 24, 32 parts, with equal width. Ideal selection: subsets with {1}, {1+2}, {1+2+3}, or { } attributes. 1. MI(C;f)  MI(f;g) algorithm: P24 no errors, for P8, 16, 32 small error (4  8) 2.Max MI(C;F): P8-24 no errors, P32 (3,4  7,8) 3.Max IC(C;F) : P24 no errors, P8 (2  6), P16 (3  7), P32 (3,4  7,8) 4.SSV decision tree based on separability criterion: creates its own discretization. Selects 1, 2, 6, 3, 7, others are not important. Univariate trees have bias for slanted distributions. Selection should take into account the type of classification system that will be used.

Hypothyroid: equal bins Mutual information for different number of equal width partitions, ordered from largest to smallest, for the hypothyroid data: 6 continuous and 15 binary attributes.

Hypothyroid: SSV bins Mutual information for different number of equal SSV decision tree partitions, ordered from largest to smallest, for the hypothyroid data. Values are twice as large since bins are more pure.

Hypothyroid: ranking Best ranking: largest area under curve: accuracy(best n features). SBL: evaluating and adding one attribute at a time (costly). Best 2: SBL, best 3: SSV BFS, best 4: SSV beam; BA - failure

Hypothyroid: ranking Results from FSM neurofuzzy system. Best 2: SBL, best 3: SSV BFS, best 4: SSV beam; BA – failure Global correlation misses local usefulness...

Hypothyroid: SSV ranking More results using FSM and selection based on SSV. SSV with beam search P24 finds the best small subsets, depending on the search depth; here best results for 5 attributes are achieved.

ConclusionsConclusions About 20 ranking and selection methods have been checked. The actual feature evaluation index (information, consistency, correlation) is not so important. Discretization is very important; naive equi-width or equidistance discretization may give unpredictable results; entropy-based discretization is fine, but separability-based is less expensive. Continuous kernel-based approximations to calculation of feature evaluation indices are a useful alternative. Ranking is easy if global evaluation is sufficient, but different sets of features may be important for separation of different classes, and some are important in small regions only – cf. decision trees. Selection requires calculation of multidimensional evaluation indices, done effectively using hashing techniques. Local selection and ranking is the most promising technique.

Open questions Discretization or kernel estimation? Best discretization: Vopt histograms, entropy, separability? How useful is fuzzy partitioning? Use of feature weighting from ranking/selection to scale input data. How to make evaluation index that includes local information? Hoe to use selection methods to find combination of attributes? These and other ranking/selection methods will be integrated into the GhostMiner data mining package: Google: GhostMiner Is the best selection method based on filters possible? Perhaps it depends on the ability of different methods to use the information contained in selected attributes.