Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria.
Machine Learning: Intro and Supervised Classification
Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.
A Tutorial on Learning with Bayesian Networks
Particle Filtering Sometimes |X| is too big to use exact inference
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
Minimum Redundancy and Maximum Relevance Feature Selection
Naïve Bayes Classifier
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
What is Statistical Modeling
Data Mining Classification: Naïve Bayes Classifier
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Overview Full Bayesian Learning MAP learning
Assuming normally distributed data! Naïve Bayes Classifier.
Mutual Information Mathematical Biology Seminar
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Statistical Methods Chichang Jou Tamkang University.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Ensemble Learning: An Introduction
ACM SAC’06, DM Track Dijon, France “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Bayesian Networks. Male brain wiring Female brain wiring.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
NAÏVE CREDAL CLASSIFIER 2 : AN EXTENSION OF NAÏVE BAYES FOR DELIVERING ROBUST CLASSIFICATIONS 이아람.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Classification Ensemble Methods 1
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Inference for the Mean of a Population
Naïve Bayes Classifier
Bayesian Classification
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Naïve Bayes Classifier
LECTURE 23: INFORMATION THEORY REVIEW
Speech recognition, machine learning
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Speech recognition, machine learning
Presentation transcript:

Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland

Mutual Information (MI) Consider two discrete random variables (, ) Consider two discrete random variables (, ) (In)Dependence often measured by MI (In)Dependence often measured by MI –Also known as cross-entropy or information gain –Examples Inference of Bayesian nets, classification trees Inference of Bayesian nets, classification trees Selection of relevant variables for the task at hand Selection of relevant variables for the task at hand

MI-Based Feature-Selection Filter (F) Lewis, 1992 Classification Classification –Predicting the class value given values of features –Features (or attributes) and class = random variables –Learning the rule features class from data Filters goal: removing irrelevant features Filters goal: removing irrelevant features –More accurate predictions, easier models MI-based approach MI-based approach –Remove feature if class does not depend on it: –Or: remove if is an arbitrary threshold of relevance is an arbitrary threshold of relevance

Empirical Mutual Information a common way to use MI in practice Data ( ) contingency table Data ( ) contingency table –Empirical (sample) probability: –Empirical mutual information: Problems of the empirical approach Problems of the empirical approach – due to random fluctuations? (finite sample) –How to know if it is reliable, e.g. by j\ij\ij\ij\i12…r 1 n 11 n 12 … n 1r 2 n 21 n 22 … n 2r s n s1 n s2 … n sr

We Need the Distribution of MI Bayesian approach Bayesian approach –Prior distribution for the unknown chances (e.g., Dirichlet) –Posterior: Posterior probability density of MI: Posterior probability density of MI: How to compute it? How to compute it? –Fitting a curve by the exact mean, approximate variance

Mean and Variance of MI Hutter, 2001; Wolpert & Wolf, 1995 Exact mean Exact mean Leading and next to leading order term (NLO) for the variance Leading and next to leading order term (NLO) for the variance Computational complexity O(rs) Computational complexity O(rs) –As fast as empirical MI

MI Density Example Graphs

Robust Feature Selection Filters: two new proposals Filters: two new proposals –FF: include feature iff (include iff proven relevant) (include iff proven relevant) –BF: exclude feature iff (exclude iff proven irrelevant) (exclude iff proven irrelevant) Examples Examples FF includes BF includes FF excludes BF includes FF excludes BF excludes I

Comparing the Filters Experimental set-up Experimental set-up –Filter (F,FF,BF) + Naive Bayes classifier –Sequential learning and testing Collected measures for each filter Collected measures for each filter –Average # of correct predictions (prediction accuracy) –Average # of features used Naive Bayes Classification Test instance Filter Instance kInstance k+1Instance N Learning data Store after classification

Results on 10 Complete Datasets # of used features # of used features Accuracies NOT significantly different Accuracies NOT significantly different –Except Chess & Spam with FF

Results on 10 Complete Datasets - ctd

FF: Significantly Better Accuracies Chess Chess Spam Spam

Extension to Incomplete Samples MAR assumption MAR assumption –General case: missing features and class EM + closed-form expressions EM + closed-form expressions –Missing features only Closed-form approximate expressions for Mean and Variance Closed-form approximate expressions for Mean and Variance Complexity still O(rs) Complexity still O(rs) New experiments New experiments –5 data sets –Similar behavior

Conclusions Expressions for several moments of MI distribution are available Expressions for several moments of MI distribution are available –The distribution can be approximated well –Safer inferences, same computational complexity of empirical MI –Why not to use it? Robust feature selection shows power of MI distribution Robust feature selection shows power of MI distribution –FF outperforms traditional filter F Many useful applications possible Many useful applications possible –Inference of Bayesian nets –Inference of classification trees –…