Feature Selection Analysis

Slides:



Advertisements
Similar presentations
Probability models- the Normal especially.
Advertisements

Distance Preserving Embeddings of Low-Dimensional Manifolds Nakul Verma UC San Diego.
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
An Overview of Machine Learning
Assuming normally distributed data! Naïve Bayes Classifier.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202 Fall 2007 Introduction to Classification Greg Grudic.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Approximations to Probability Distributions: Limit Theorems.
Ensemble Learning (2), Tree and Forest
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Chebyshev’s Inequality Markov’s Inequality Proposition 2.1.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
VIII Międzynarodowe Forum Finansowo-Bankowe
Support Vector Machine
Chapter 3: Maximum-Likelihood Parameter Estimation
Randomness in Neural Networks
LECTURE 11: Advanced Discriminant Analysis
Instance Based Learning
LECTURE 10: DISCRIMINANT ANALYSIS
第 3 章 神经网络.
LECTURE 16: SUPPORT VECTOR MACHINES
CSE P573 Applications of Artificial Intelligence Decision Trees
Pawan Lingras and Cory Butz
Classification Discriminant Analysis
Students: Meiling He Advisor: Prof. Brain Armstrong
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Classification Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
Generalization in deep learning
LECTURE 17: SUPPORT VECTOR MACHINES
LECTURE 09: DISCRIMINANT ANALYSIS
Machine Learning Support Vector Machine Supervised Learning
What is The Optimal Number of Features
Discrimination and Classification
უმაღლესი საგანმანათლებლო დაწესებულებების ავტორიზაცია:
SPECIAL ISSUE on Document Analysis, 5(2):1-15, 2005.
Logistic Regression Geoff Hulten.
Presentation transcript:

Feature Selection Analysis An attempt at a generalized relationship between sample size and dimensionality Project for 9.520 Nathan Eagle

Motivation Expense of taking/labeling additional sample data How much training data is really necessary?

Empirical Evidence – “s-curve” SVM(fu) Classifier on Sayan’s Feature Selection Technique

Empirical Evidence – linearity Linear relationship between samples and dimensions

Proof I – Hypothesis Testing (1) (2) But what are the priors – pfeat? What if there are more than 1 relevant feature?

Proof II – Chebyshev and Weak Law of Large Numbers (3) From W.L.L.N.: (4) From Chebyshev’s inequality: (5)

Proof II (cont) From before: (6) Inversing the probability: (7) For all features: (8)

Proof II (cont) From before: (9) setting: Sample Size vs. Dimensions Irrelevant Dimensions/Features Training Sample Size

Proof III – Sayan’s Generalization Error Algorithm Generalization Error for two classes drawn from Gaussian distributions:* (10) Where the separating hyperplane is define as: (11) Fisher Linear Discriminant * As proved in Sayan Mukherjee’s PhD thesis

Results THEORITICAL EMPERICAL 1 iteration 50 iterations

Conclusions Sample size seems to scale linearly with irrelevant features both empirically and theoretically – regardless of the classifier. The ‘s-curve’ does not seems to be a generalized property of all feature selection methods