Feature Selection Topics

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Applications of one-class classification
DECISION TREES. Decision trees  One possible representation for hypotheses.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Boosting Approach to ML
Model Assessment, Selection and Averaging
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Lecture 4: Embedded methods
Sparse vs. Ensemble Approaches to Supervised Learning
Decision Tree Algorithm
Feature Selection for Regression Problems
Three kinds of learning
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Sparse vs. Ensemble Approaches to Supervised Learning
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Online Learning Algorithms
A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Validation methods.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Classification and Regression Trees
Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
TMVA New Features Sergei V. Gleyzer Omar Zapata Mesa Lorenzo Moneta Data Science at the LHC Workshop Nov. 9, 2015.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning with Spark MLlib
Inter-experimental LHC Machine Learning Working Group Activities
Information Management course
Implementing Boosting and Convolutional Neural Networks For Particle Identification (PID) Khalid Teli .
COMP61011 : Machine Learning Ensemble Models
Data Mining (and machine learning)
Machine Learning Feature Creation and Selection
Data Mining Practical Machine Learning Tools and Techniques
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Machine Learning in Practice Lecture 23
Feature Selection Methods
Presentation transcript:

Feature Selection Topics Sergei V. Gleyzer Data Science at the LHC Workshop Nov. 9, 2015

Sergei V. Gleyzer Data Science at LHC Workshop Outline Motivation What is Feature Selection Feature Selection Methods Recent work and ideas Caveats Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Motivation Common Machine Learning (ML) problems in HEP: Classification or class discrimination Higgs event or background? Regression or function estimation How to best model particle energy based on detector measurements Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Motivation continued While performing data analysis one of the most crucial decisions is which features to use Garbage In = Garbage Out Ingredients: Relevance to the problem Level of understanding of the feature Power of the feature and its relationship with others Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Goal How to: Select Assess Improve Feature set used to solve the problem Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Example Build a classifier to discriminate events of different classes based on event kinematics Typical initial feature set: Functions of object four-vectors in event Basic kinematics: transverse momenta, invariant masses, angular separations More complex features relating objects in the event topology using physics knowledge to help discriminate among classes (thrusts, helicity e.t.c.) Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Initial Selection Features initially chosen due to their individual performance How well does X discriminate between signal and background? Vetos: Is X well-understood? Theoretical and other uncertainties Monte-Carlo and data agreement Arrive at order of 10-30 features (95% use cases) Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Feature Engineering By combining features with each other, boosting into other frames of reference* this set can grow quickly from tens to hundreds of features That’s ok if you have enough of computational power Still small compared to 100k features of cancer/image recognition datasets Balance between Occam’s razor and need for additional performance/power * JHEP 1104:069,2011 K. Black, et. al. Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Practicum Training Set Feature Selection Model Building f() ✓ Training/Test Set Feature Selection* Model Building f() ✗ Test Set f() Estimate Performance ✓ *Feature Selection Bias Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Methods Filters Wrappers Embedded- Hybrid Feature Selection Model Building Model Building Feature Selection Feature Selection during Model Building Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Filter Methods Filters: usually fast No feedback from the Classifier Use correlations/mutual information gain “quick and dirty” and less accurate Useful in pre-processing Example algorithms: information gain, Relief, rank-sum test, e.t.c. Feature Selection Model Building Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Wrapper Methods Wrappers: typically slower and relatively more accurate (due to model-building) Tied to a chosen model: Use it to evaluate features Assess feature interactions Search for optimal subset of features Different types: Methodical Probabilistic (random hill-climbing) Heuristic (forward backward elimination) Model Building Feature Selection Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Ex: Feature Importance Feature Importance proportional to classifier performance in which feature participates Full feature set {V} Feature subsets {S} Classifier performance F(S) Fast stochastic version uses random subset seeds Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Example: RuleFit Rulefit: rule-based binary classification and regression (J. Friedman) Transforms decision trees into rule ensembles A powerful classifier even if some rules are poor Feature Importance: Proportional to performance of classifiers in which features participate (similar) Difference: no Wi(S) Individual Classifier Performance evenly divided among participating features Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Selection Caveats Feature Selection Bias Common mistake leading to over-optimistic evaluation of classifiers from “usage” of the testing dataset Solution: M-fold cross-validation/Bootstrap Second testing sample for evaluation Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Feature Interactions Features Like to Interact Too Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Feature Interactions Features often interact strongly in the classification process. Their removal affects the performance of remaining interacting partners Strength of interaction quantified by some wrapper methods In some classifiers features can be overlooked (or shadowed) by their interacting partners Beware of hidden reefs Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Selection Caveats Remove Before After Importance Landscape Has Changed Holds for any criterion that doesn’t incorporate interactions Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Global Loss Function GLoss Function Global measure of loss Selects feature subsets for global removal Shows the amount of predictive power loss relative to the upper bound of performance of remaining classifiers S’ is the subset to be removed Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Global Loss and Classifier Performance GLoss Function minimization NOT EQUIVALENT to Maximization of F(S) – i.e. finding the highest performing classifier and its constituent features Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Recent Work Probabilistic Wrapper Methods: Hybrid Algorithms: Stochastic approach Hybrid Algorithms: combine Filters and Wrappers Embedded Methods Feature Selection Model Building Feature Selection during Model Building Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Embedded Methods At model-building stage assess feature importance and incorporate it in the process Way to penalize/remove features in the classification or regression process Regularization Examples: LASSO, RegTrees Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Regularized Trees Inspired by Rules Regularization in Friedman and Popescu 2008 Decision Tree Reminder: “Votes” taken at decision junctions on possible splits among the features RegTrees penalize during voting features similar to those used in previous decisions End up with a high quality feature set classifier Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Feature Amplification Another example: feedback feature importance into classifier building Weight votes by log(feature importance) Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Decade Ahead H. Prosper Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Minimal SUSY Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop In HEP Often in HEP one searches for new phenomena and applies classifiers trained on MC for at least one of the classes (signal) or sometimes both to real data Flexibility is KEY to any search It is more beneficial to choose a reduced parameter space that consistently produces strong performing classifiers at actual analysis time Useful for general SUSY and other new phenomena searches Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Feature Selection Tools R (CRAN): Boruta, RFE, CFS, Fselector, caret TMVA: FAST algo (stochastic wrapper), Global Loss function Scikit-Learn Bioconductor Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop

Sergei V. Gleyzer Data Science at LHC Workshop Summary Feature selection is important part of robust HEP Machine Learning applications Many methods available Watch out for caveats Happy ANALYZING Nov. 9, 2015 Sergei V. Gleyzer Data Science at LHC Workshop