Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Minimum Redundancy and Maximum Relevance Feature Selection
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Feature Selection Presented by: Nafise Hatamikhah
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Feature Selection for Regression Problems
Reduced Support Vector Machine
Evaluating Hypotheses
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 3, 2012.
Constraint Based (CB) Approach - ‘PC algorithm’  CB algorithm that learns a structure from complete undirected graph and then "thins" it to its accurate.
Probability theory: (lecture 2 on AMLbook.com)
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
A Comparative Study on Variable Selection for Nonlinear Classifiers C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
Bayesian Networks Martin Bachler MLA - VO
Benk Erika Kelemen Zsolt
Hands-on predictive models and machine learning for software Foutse Khomh, Queen’s University Segla Kpodjedo, École Polytechnique de Montreal PASED - Canadian.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
CpSc 881: Machine Learning Evaluating Hypotheses.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Machine Learning with Spark MLlib
Data Mining (and machine learning)
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
K-means and Hierarchical Clustering
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
EE5900 Advanced Embedded System For Smart Infrastructure
Chapter 7: Transformations
Feature Selection Methods
Machine Learning: Lecture 6
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Machine Learning: UNIT-3 CHAPTER-1
Evaluating Hypothesis
Introduction to Machine learning
Machine Learning: Lecture 5
Reinforcement Learning (2)
Presentation transcript:

Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff

Agenda Background Filter and wrapper methods Randomized Variable Elimination Cost Function RVE algorithm when r is known (RVE) RVE algorithm when r is not known (RVErS) Results Questions

Variable Selection Problem Choosing relevant attributes from set of attributes. Producing a subset of variables from large set of input variables that best predicts target function. Forward selection algorithm starts with an empty set and searches for variables to add. Backward selection algorithm starts with entire set of variables and go on removing irrelevant variable(s). In some cases, forward selection algorithm also removes variables in order to recover from previous poor selections. Caruna and Freitag (1994) experimented with greedy search methods and found that allowing search to add or remove variables outperform simple forward and backward searches Filter and wrapper methods for variable selection.

Filter methods Uses statistical measures to evaluate the quality of variable subsets. Subset of variables are evaluated with respect to specific quality measure. Statistical evaluation of variables require very little computational cost as compared to running the learning algorithm. FOCUS (Almuallim and Dietterich, 1991) searches for smallest subset that completely discriminates between target classes. Relief (Kira and Rendell, 1992) ranks variables as per distance. In filter methods, variables are evaluated independently and not in context of learning problem.

Wrapper methods Uses performance of the learning algorithm to evaluate the quality of subset of input variables. The learning algorithm is executed on the candidate variable set and then tested for the accuracy of resulting hypothesis. Advantage: Since wrapper methods evaluate variables in the context of learning problem, they outperform filter methods. Disadvantage: Cost of repeatedly executing the learning algorithm can become problematic. John, Kohavi, and Pfleger (1994) coined the term “wrapper” but the technique was used before that (Devijver and Kittler, 1982)

Randomized Variable Elimination Falls under the category of wrapper methods. First, a hypothesis is produced for entire set of ‘n’ variables. A subset if formed by randomly selecting ‘k’ variables. A hypothesis is then produced for remaining (n-k) variables. Accuracy of the two hypotheses are compared. Removal of any relevant variable should cause an immediate decline in performance Uses a cost function to achieve a balance between successive failures and cost of running the learning algorithm several times.

The Cost Function

Probability of selecting ‘k’ variables The probability of successfully selecting ‘k’ irrelevant variables at random is given by where, n … remaining variables r … relevant variables

Expected number of failures The expected number of consecutive failures before a success at selecting k irrelevant variables is given by Number of consecutive trials in which at least one of the r relevant variables will be randomly selected along with irrelevant variables.

Cost of removing k variables The expected cost of successfully removing k variables from n remaining given r relevant variables is given by where, M(L, n) represents an upper bound on the cost of running algorithm ‘L’ on n inputs.

Optimal cost of removing irrelevant variables The optimal cost of removing irrelevant variables from n remaining and r relevant is given by

Optimal value for ‘k’ The optimal value is computed as It is the value of k for which the cost of removing variables is optimal.

Algorithms

Algorithm for computing k and cost values Given: L, N, r I sum [r+1…N] ← 0 k opt [r+1…N] ← 0 for i ← r+1 to N do bestCost ← ∞ for k ← 1 to i-r do temp ← I(i,r,k) + I sum [i-k] if (temp < bestCost) then bestCost ← temp bestK ← k I sum [i] ← bestCost k opt [i] ← bestK

Randomized Variable Elimination (RVE) when r is known Given: L,n,r, tolerance Compute tables for I sum (i,r) and k opt (i,r) h ← hypothesis produced by L on ‘n’ inputs while n > r do k ← k opt (n,r) select k variables at random and remove them h’ ← hypothesis produced by L on n-k inputs if e(h’) – e(h) ≤ tolerance then n ← n-k h ← h’ else replace the selected k variables

RVE example Plot of expected cost of running RVE(I sum (N,r = 10)) along with cost of removing inputs individually, and the estimated number of updates M(L,n). L is function that learns a boolean function using perceptron unit.

Randomized Variable Elimination including a search for ‘r’ (RVErS) Given: L, c 1, c 2, n, r max, r min, tolerance Compute tables I sum (i,r) and k opt (i,r) for r min ≤ r ≤ r max r ← (r max + r min ) / 2 success, fail ← 0 h ← hypothesis produced by L on ‘n’ inputs repeat k ← k opt (n,r) select k variables at random and remove them h’ ← hypothesis produced by L on (n-k) inputs if e(h’) – e(h) ≤ tolerance then n ← n – k h ← h’ success ← success + 1 fail ← 0 else replace the selected k variables fail ← fail + 1 success ← 0

RVErS (contd…) if n ≤ r min then r, r max, r min ← n else if fail ≥ c 1 E⁻(n,r,k) then r min ← r r ← (r max + r min ) / 2 success, fail ← 0 else if success ≥ c 2 (r – E⁻(n,r,k)) then r max ← r r ← (r max + r min ) / 2 success, fail ← 0 until r min < r max and fail ≤ c 1 E⁻(n,r,k)

Comparison of RVE and RVErS

Results

Variable Selection results using naïve Bayes and C4.5 algorithms

My implementation Integrate with Weka Extend the NaiveBayes and J48 algorithms Obtain results for some UCI datasets used Compare results with those reported by authors Work in progress

RECAP

Questions

References H. Almuallim and T.G Dietterich. Leraning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim, CA, MIT Press. R. Caruna and D. Freitag. Greedy attribute selection. In Machine Learning: Proceedings of Eleventh International Conference, Amherst, MA, Morgan Kaufmann. K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings of Ninth International Conference, San Mateo, CA, Morgan Kaufmann.

References (contd…) G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and subset selection problem. In Machine Learning: Proceedings of Eleventh Internaltional Conference, pages , New Brunswick, NJ, Morgan Kauffmann. P.A. Devijver and J. Kittler. Pattern Recognition: A statistical approach. Prentice Hall/International, 1982