Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Friday, Febuary 2, 2001 Presenter:Ajay.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC 423 ARTIFICIAL INTELLIGENCE
Decision Tree Algorithm
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
Feature Selection for Regression Problems
Induction of Decision Trees
Evaluating Hypotheses
Classification Continued
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning Chapter 18 and Parts of Chapter 20
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Chapter 9 – Classification and Regression Trees
CpSc 810: Machine Learning Decision Tree Learning.
Learning from Observations Chapter 18 Through
Computing & Information Sciences Kansas State University Monday, 13 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 34 of 42 Monday, 13 November.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, February 5, 2001.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Friday Finish reading chapter 4 Homework: –Lisp handout 4.
For Monday Read chapter 4, section 1 No homework..
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Review: Tree search Initialize the frontier using the starting state While the frontier is not empty – Choose a frontier node to expand according to search.
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Wednesday, 21.
Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Tuesday, September 7, 1999.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
For Monday Read chapter 4 exercise 1 No homework.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday, 25 January 2007 William.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Information Management course
Data Science Algorithms: The Basic Methods
Analysis and design of algorithm
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Major Design Strategies
CMSC 471 Fall 2011 Class #4 Tue 9/13/11 Uninformed Search
Major Design Strategies
Presentation transcript:

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Friday, Febuary 2, 2001 Presenter:Ajay Gavade Paper #2: Liu and Motoda, Chapter 3 Aspects Of Feature Selection for KDD Presentation

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Outline Categories of Feature Selection Algorithms –Feature Ranking Algorithms –Minimum Subset Algorithms Basic Feature Generation Schemes & Algorithms –How do we generate subsets? –Forward, backward, bidirectional, random Search Strategies & Algorithms –How do we systematically search for a good subset? –Informed & Uninformed Search –Complete search –Heuristic search –Nondeterministic search Evaluation Measure –How do we tell how good a candidate subset is? –Information gain, Entropy.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence The Major Aspects Of Feature Selection Search Directions (Feature Subset Generation) Search Strategies Evaluation Measures A Particular method of feature selection is a combination of some possibilities of every aspect. Hence each method can be represented by a point in the 3-D structure.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Major Categories of Feature Selection Algorithms (From The Point Of View Of Method’s Output) Feature Ranking Algorithms These algorithms return a ranked list of features ordered according to some evaluation measure. The algorithm tells the importance (relevance) of a feature compared to others.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Major Categories of Feature Selection Algorithms (From The Point Of View Of Method’s Output) Minimum Subset Algorithms These algorithms return a minimum feature subset, and no difference is made for features in the subset. Theses algorithms are used when we don’t know the number of relevant features.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Basic Feature Generation Schemes Sequential Forward Generation N-step look-ahead form. Starts with empty set and adds features from the original set sequentially. Features are added according to relevance. One -step look-ahead form is the most commonly used schemes because of good efficiency A minimum feature subset or ranked list can be obtained. Can deal with noise in data.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Basic Feature Generation Schemes Sequential Backward Generation Starts with full set and removes one feature at a time from the original set sequentially. Least relevant feature is removed. But this tells nothing about the ranking of the relevant features remaining. Doesn't guarantee absolute minimal subset.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Basic Feature Generation Schemes Bidirectional Generation This runs SFG and SBG in parallel, and stops when one algorithm finds a satisfactory subset. Optimizes the speed if number of relevant features is unknown.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Basic Feature Generation Schemes Random Generation Sequential Generation Algorithms are fast on average, but they can’t guarantee absolute minimum valid set i.e. optimal feature subset. Because if they hit a local minimum (a best subset at the moment) they have no way to get out. Random Generation scheme produces subset at random. A good random number generator is required so that every combination of features ideally has a chance to occur and occurs just once.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Search Strategies Exhaustive Search Depth-First Search Exhaustive search is complete since it covers all combinations of features. But a complete search may not be exhaustive. This search goes down one branch entirely, and then backtracks to another branch.This uses stack data structure (explicit or implicit)

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Depth-First Search 3 features a,b,c abc a, ba,cb,c a,b,c

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Search Strategies Breadth-First Search This search moves down layer by layer, checking all subsets with one feature, then with two features, and so on. This uses queue data structure. Space Complexity makes it impractical in most cases.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Breadth-First Search 3 features a,b,c abc a, ba,cb,c a,b,c

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Search Strategies Complete Search Branch & Bound Search It is a variation of depth-first search hence it is exhaustive search. If evaluation measure is monotonic, this search is a complete search and guarantees optimal subset.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Branch & Bound Search 3 features a,b,c Bound Beta =12 a,b,c abc a, ba,cb,c

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Heuristic Search Best-First Search Quick To Find Solution (Subset of Features) This is derived from breadth-first search. This expands its search space layer by layer, and chooses one best subset at each layer to expand. Finds Near Optimal Solution More Speed With Little Loss of Optimality Beam Search

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Best-First Search 3 features a,b,c abc a, ba,cb,c a,b,c

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Search Strategies Approximate Branch & Bound Search This is an extension of the Branch & Bound Search In this the bound is relaxed by some amount , this allows algorithm to continue and reach optimal subset. By changing , monotonicity of the measure can be observed.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Approximate Branch & Bound Search 3 features a,b,c a,b,c a,ba,cb,c abc

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Nondeterministic Search Avoid Getting Stuck in Local Minima Capture The Interdependence of Features RAND It keeps only the current best subset. If sufficiently long running period is allowed and a good random function is used, it can find optimal subset. Problem with this algorithm is we don’t know when we reached the optimal subset. Hence stopping condition is the number of maximum loops allowed.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence What is Entropy ? A Measure of Uncertainty –The Quantity Purity: how close a set of instances is to having just one label Impurity (disorder): how close it is to total uncertainty over labels –The Measure: Entropy Directly proportional to impurity, uncertainty, irregularity, surprise Inversely proportional to purity, certainty, regularity, redundancy Example –For simplicity, assume H = {0, 1}, distributed according to Pr(y) Can have (more than 2) discrete class labels Continuous random variables: differential entropy –Optimal purity for y: either Pr(y = 0) = 1, Pr(y = 1) = 0 Pr(y = 1) = 1, Pr(y = 0) = 0 Entropy is 0 if all members of S belong to same class. Evaluation Measures p + = Pr(y = +) 1.0 H(p) = Entropy(p)

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Entropy: Information Theoretic Definition Components –D: a set of examples {,, …, } –p + = Pr(c(x) = +), p - = Pr(c(x) = -) Definition –H is defined over a probability density function p –D contains examples whose frequency of + and - labels indicates p + and p - for the observed data –The entropy of D relative to c is: H(D)  -p + log b (p + ) - p - log b (p - ) If a target attribute can take on c different values, the entropy of S relative to this c- wise classification is defined as, where p i is the proportion of S belonging to the class I.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Entropy Entropy is 1 when S contains equal number of positive & negative examples. Entropy specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S. –What is the least pure probability distribution? Pr(y = 0) = 0.5, Pr(y = 1) = 0.5 Corresponds to maximum impurity/uncertainty/irregularity/surprise Property of entropy: concave function (“concave downward”) What Units is H Measured In? –Depends on the base b of the log (bits for b = 2, nats for b = e, etc.) –A single bit is required to encode each example in the worst case (p + = 0.5) –If there is less uncertainty (e.g., p + = 0.8), we can use less than 1 bit each

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Information Gain It is a measure of the effectiveness of an attribute in classifying the training data. Measures the expected reduction in Entropy caused by partitioning the examples according to the attribute. Measure the uncertainty removed by splitting on the value of attribute A The information gain,Gain(S,A) of an attribute A, relative to collection of examples S is, where values(A) is the set of all possible values of A. Gain(S,A) is the information provided about the target function value, given the value of some attribute A. The value of Gain(S,A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence An Illustrative Example Prior (unconditioned) distribution: 9+, 5- –H(D) = -(9/14) lg (9/14) - (5/14) lg (5/14) bits = 0.94 bits –H(D, Humidity = High) = -(3/7) lg (3/7) - (4/7) lg (4/7) = bits –H(D, Humidity = Normal) = -(6/7) lg (6/7) - (1/7) lg (1/7) = bits –Gain(D, Humidity) = (7/14) * (7/14) * = bits –Similarly, Gain (D, Wind) = (8/14) * (6/14) * 1.0 = bits [6+, 1-][3+, 4-] Humidity HighNormal [9+, 5-] [3+, 3-][6+, 2-] Wind LightStrong [9+, 5-]

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Attributes with Many Values Problem –If attribute has many values, Gain() will select it (why?) –Imagine using Date = 06/03/1996 as an attribute! One Approach: Use GainRatio instead of Gain –SplitInformation: directly proportional to c = | values(A) | –i.e., penalizes attributes with more values e.g., suppose c 1 = c Date = n and c 2 = 2 SplitInformation (A 1 ) = lg(n), SplitInformation (A 2 ) = 1 If Gain(D, A 1 ) = Gain(D, A 2 ), GainRatio (D, A 1 ) << GainRatio (D, A 2 ) –Thus, preference bias (for lower branch factor) expressed via GainRatio()

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Summary Points Heuristic : Search :: Inductive Bias : Inductive Generalization Entropy and Information Gain –Goal: to measure uncertainty removed by splitting on a candidate attribute A Calculating information gain (change in entropy) Using information gain in construction of tree Search & Measure Search and measure play dominant role in feature selection. Stopping criteria are usually determined by a particular combination of search & measure. There are different feature selection methods with different combinations of search & evaluation measures.