Vincent Granville, Ph.D. Co-Founder, DSC

Vincent Granville, Ph.D. Co-Founder, DSC

Feature Selection For Unsupervised Learning Abstract After reviewing popular techniques used in supervised, unsupervised and semi-supervised machine learning, we focus on feature selection methods in these different contexts, especially the metrics used to assess the value of a feature or set of features, be it binary, continuous or categorical variables. We go in deeper details and review modern feature selection techniques for unsupervised learning, typically relying on entropy-like criteria. While these criteria are usually model-dependent or scale-dependent, we introduce a new model-free, data-driven methodology in this context, with an application to an interesting number theory problem (simulated data set) in which each feature has a known theoretical entropy. We also briefly discuss high precision computing as it is relevant to this peculiar data set, as well as units of information smaller than the bit. IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Content 1. Review of supervised and unsupervised learning 2. Feature selection * Supervised (goodness of fit) * Unsupervised (entropy) 3. New approach (unsupervised) IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Supervised Learning Based on training sets, cross-validation, goodness-of-fit. Popular techniques: 1. Linear or logistic regression, predictive modeling 2. Neural nets, deep learning 3. Supervised classification 4. Regression and decision trees IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Unsupervised Learning
No training set. Popular techniques: 1. Pattern recognition, association rules 2. Unsupervised clustering, taxonomy creation 3. Data reduction or compression 4. Graph-based methods 5. NLP, image processing 6. Semi-supervised Note: Overlap between supervised and unsupervised. Example: Neural nets (can be supervised or unsupervised.) IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Star clusters classification based on age and metal content
Unsupervised Clustering: Example Star clusters classification based on age and metal content IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

The Concept of Feature (1/2)
Two types of variables: 1. Dependent variable (or response) – supervised learning 2. Independent variables (or predictor or feature) – usually cross-correlated Y = a1 X1 + a2 X2 + …. Here, * Y is the dependent variable, * the X’s are the features, * the a’s are the model parameters (to be estimated) IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

The Concept of Feature (2/2)
Variables (features) can be: * Discrete / continuous * Qualitative (gender, country, cluster label) * Mixed (unstructured, text, data) * Binary (dummy variable to represent gender) * Raw, binned, summarized, or compound variables Potential issues: * Duplicated or Missing data * Fuzzy data (when merging man-made databases, or due to typos) Example: MIT = M.I.T = Massachusetts Institute of Technology Solution: use lookup table, e.g. matching MIT with its variations IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Feature Selection: Supervised Learning
Select features with best predictive power. Algorithms: * Step-wise (add one feature at a time) * Backward (start with all features, removing one at a time) * Mixed Criteria for feature selection * R-squared (or robust version of this metric) * Goodness-of-fit-metric, confusion matrix (clustering) Notes: * Unlike PCA, it leaves features unchanged (easier for interpretation) * Combinatorial problem, local optimum OK, stopping rule needed IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Feature Selection: Unsupervised Learning
Select features with highest entropy. Criteria for feature selection * Shannon entropy (categorical features) * Joint entropy (measured on a set of features) * Differential entropy (for continuous features) * Akaide information (related to maximum likelihood) * Kullback-Leibler divergence (equivalent to Akaide information) Drawbacks * Model-dependent (not data-driven) * Scale-dependent IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

New Methodology for Unsupervised Learning (1/3)
Create an artificial response Y, as if dealing with a linear regression problem. Set regression coefficients to 1. Compute R-squared (or similar metrics) on subsets of features, to identify best subsets given a fixed number of features. Add one (or two) feature at a time. Proceed as in the supervised learning framework. Benefits * Scale-invariant * Model-free, data-driven * Simple, easy to understand * Can handle categorical features, as dummy variables More advanced version * Test various sets of regression coefficients (Monte Carlo simulations) IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Example Simulated data sets with two binary features X1 and X2, with artificial response Y = X1 + X2, and with known theoretical entropy for each feature. Features are correlated and also exhibit auto-correlation, as in real data sets. The data sets have 47 observations. Instead of entropy, we use correlation between response and a feature, to measure the amount of information attached to the feature in question. Results Computed Shannon entropy, theoretical entropy, and the correlation-based information metric introduced here, are almost equivalent: the higher the correlation, the higher the entropy. Divergence occurs only when both features almost have the same entropy, so not an issue. The best feature is the one maximizing the selection criterion. IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Details about the Example The data set in the spreadsheet consists of the first 47 digits of two numbers: * Feature #1: digits of log(3/2) in base b = * Feature #2: digits of SQRT(1/2) in base b = 2 Interesting facts * For more than 47 observations, HPC is needed due to machine precision * A digit in base 1 < b < 2 carries less than one bit of information * The theoretical entropy is proportional to the logarithm of the base b * The methodology was also successfully tested on continuous features IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Resources The following web document (see URL below) contains: * References to various entropy criteria and feature selection techniques * Details about the new methodology * Access to spreadsheet with data and detailed computations * Reference to the underlying number theory context (numeration systems) * Reference to high precision computing Link: IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Thank you Vincent Granville, Ph.D. Co-Founder, DSC —
DataScienceCentral.com IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation

Vincent Granville, Ph.D. Co-Founder, DSC

Similar presentations

Presentation on theme: "Vincent Granville, Ph.D. Co-Founder, DSC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vincent Granville, Ph.D. Co-Founder, DSC

Similar presentations

Presentation on theme: "Vincent Granville, Ph.D. Co-Founder, DSC"— Presentation transcript:

Similar presentations

About project

Feedback