What is The Optimal Number of Features

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Image Modeling & Segmentation
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Fast Algorithms For Hierarchical Range Histogram Constructions
CS479/679 Pattern Recognition Dr. George Bebis
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
What is Statistical Modeling
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Assuming normally distributed data! Naïve Bayes Classifier.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Probably Approximately Correct Model (PAC)
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Optimal Bayes Classification
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 9633 Machine Learning Support Vector Machines
CS479/679 Pattern Recognition Dr. George Bebis
Machine Learning with Spark MLlib
Support Vector Machine
Chapter 7. Classification and Prediction
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 10: DISCRIMINANT ANALYSIS
Trees, bagging, boosting, and stacking
Outlier Processing via L1-Principal Subspaces
SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.
Alternative Representations for Artificial Immune Systems
Basic machine learning background with Python scikit-learn
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Roberto Battiti, Mauro Brunato
Computability and Complexity
Pawan Lingras and Cory Butz
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Classification Discriminant Analysis
Hidden Markov Models Part 2: Algorithms
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Computational Learning Theory
EE513 Audio Signals and Systems
Population Proportion
Ensemble learning.
LECTURE 09: DISCRIMINANT ANALYSIS
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

What is The Optimal Number of Features What is The Optimal Number of Features? A learning theoretic Perspective Amir Navot Joint work with Ran Gilad-Bachrach, Yiftah Navot and Naftali Tishby

What is Feature Selection? … Instance 1 1 5 3 Instance 2 8 98 Instance 3 10 77 7 4 Instance 4 45 59 Feature selection: select a “good” small subset of features (out of the given set) “Good” subset: enables to build good classifiers Feature selection is a special form of dimensionality reduction

Reasons to do Feature Selection Reduces computational complexity Saves the cost of measuring extra features The selected features can provide insights about the nature of the problem Improves accuracy Improves accuracy

The Questions Under which conditions can feature selection improve classification accuracy? What is the optimal number of features? How does this number depend on the training set size? We discuss these questions by analyzing one simple setting

Two Gaussians - Problem Setting Binary classification task The coordinates are the features Optimal classifier: given features subset F : If is known: using all the features is optimal

Problem Setting – Cont. Assume that is estimated from a sample of size m, Given an estimator and a features subset F , we consider the classifier We want to find a subset of features that minimizes the average generalization error: We assume ) we only need to find the optimal number of features We consider the optimal estimator:

Illustration  -

Result The number of features that minimizes the average error for a training set of size m is: Observations: If , there is a non trivial optimal n When then is optimal choice Decision on adding depends on other features

Solving for Specific 

Solving for Specific  - Cont. x-axis: number of features

Proof For a given estimator for , the generalization error of the classifier is: ( is the CDF function of a standard Gaussian )

Proof – Cont. We want to find the number of features n that minimizes the average error: Lemma: is a good approximation for when m is large enough.

Proof – Cont. equivalent to maximizing Therefore, minimizing is For the optimal estimator , and Therefore,

“Empirical Proof” of the Lemma x-axis: number of features

Conclusions Even when all the features carry information and are independent, Using all the features may be suboptimal The decision to add a feature depends on the others The optimal number of features depends critically on the sample size ( the quality of model estimation)

What is The Optimal Number of Features What is The Optimal Number of Features? A learning theoretic Perspective Amir Navot Joint work with Ran Gilad-Bachrach, Yiftah Navot and Naftali Tishby