Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Web Usage Mining Classification Fang Yao MEMS Humboldt Uni zu Berlin.
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning in Real World: C4.5
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree Algorithm (C4.5)
Decision Tree.
Classification Techniques: Decision Tree Learning
Naïve Bayes: discussion
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Biological Data Mining (Predicting Post-synaptic Activity in Proteins)
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
K Nearest Neighbor Classification Methods Qiang Yang.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Prepared by: Mahmoud Rafeek Al-Farra
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Advanced Artificial Intelligence
Dept. of Computer Science University of Liverpool
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classification I

2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output: A model that accurately predicts y from x on previously unseen instances –Previously unseen instances are called test set Usually input collection is divided into –Training set for building the required model –Test set for evaluating the model built

3 Example Data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno Class Attribute

4 Several Techniques Decision tree based methods –Study in this lecture Rule based methods Nearest neighbour methods Naïve Bayes and Bayesian networks

5 Decision Tree Construction Recursive procedure –Select an attribute to place at the root node –Make one branch for each possible value of the selected attribute –For each branch repeat the above two steps recursively Using only those instances that actually reach the branch –Stop developing a branch if it has instances belonging to the same class Several decision trees are possible –Based on the order of the selection of the attributes

6 Example Decision Tree 1 Humidity Outlook Windy no yes no yes Windy yes high normal sunny overcast rainy sunny overcast rainy no true false true false (3,4)(6,1) 3 2 (1,1) (2,1) Observations: Outlook and windy repeated in this tree Windy tested only when outlook is rainy Actions: Start the tree with Outlook Test windy when outlook is rainy

7 Example Decision Tree 2 outlook sunny overcast rainy windy humidity no yes no highnormal false true (2,3) 4 (3,2)

8 Occam’s Razor Principle stated by William of Ockham “Other things being equal, simple theories are preferable to complex ones” Informally, “Keep it simple, stupid!!” This has been the guiding principle for developing scientific theories Applied to our two decision trees describing the weather data –Decision tree 2 is preferable to decision tree 1 Small decision trees are better –Attribute ordering makes the difference

9 Attribute Selection for Splitting In our case, outlook is a better attribute at the root than others –Because when outlook is at the root, one of its branches (overcast) immediately leads to a ‘pure’ daughter node which terminates further tree growth –Pure nodes have the same class label for all the instances reaching that node We need a generic function that helps to rank order attributes –The function should reward attributes that produce ‘pure’ daughter nodes

10 Entropy Entropy is a measure of purity expressed in bits For a node in the tree, it represents the expected amount of information that would be needed to specify the class of a new instance that reached the node Entropy(p 1,p 2,…,p n ) = -p 1 logp 1 – p 2 logp 2 … - p n logp n Logarithms are computed for base 2 because we want the entropy measure in bits p1, p2, … pn are fractions that sum up to 1 Because logarithms of fractions are negative, minus signs are used in the above formula to keep the entropy measure positive For the weather data –p1 = fraction of instances with play is true = 9/14 –p2 = fraction of instances with play is false = 5/14 – entropy(9/14,5/14) = -9/14log9/14 – 5/14log5/14 = -9/14(log9 – log14) -5/14(log5 – log14) = -9/14log9 + 9/14log14 – 5/14log5 +5/14log14 =-9/14log9 -5/14log5 +14/14log14 = (-9log9 -5log5 +14log14)/14 = bits (fractions of bits allowed!!)

11 Tree stumps for the weather data outlook temperature humidity windy Yes No Yes No sunny overcast rainy Yes No Yes No Yes No Yes No Yes No Yes No Yes No

12 Entropy for the outlook stump Count the numbers of yes and no classes at the leaf nodes –[2,3], [4,0] and [3,2] Compute the entropy for each branch of the stump –Entropy(2/5,3/5) = bits –Entropy(4/4,0/4) = 0.0 bits –Entropy(3/5,2/5) = bits Compute the entropy for the whole stump –Entropy([2,3],[4,0],[3,2]) = 5/14* Entropy(2/5,3/5) + 4/14* Entropy(4/4,0/4) + 5/14* Entropy(3/5,2/5) =5/14* /14*0+5/14*0.971 =0.693 bits Represents the information needed in bits to specify the class for a new instance using this stump

13 Information Gain When the attribute outlook was not considered, the weather data set has an entropy of bits (as computed on slide 10) Entropy for the outlook stub is bits (as computed on the previous slide) We made a saving of (0.940 – 0.693) bits on the information needed to specify the class of a new instance using this stump –This is the informational value of creating the outlook node –Gain(outlook) = (0.940 – 0.693) bits = bits Similar computations for other stumps give us –Gain(temperature) = bits –Gain(humidity) = bits –Gain(windy)=0.048 Because information gain is maximum for outlook, it should be selected for the root node Continuing with the above procedure builds the example decision tree 2 ID3 algorithm uses information gain with the recursive procedure described earlier

14 Weather Data with ID IDOutlookTemperatureHumidityWindyPlay asunnyhothighfalseno bsunnyhothightrueno covercasthothighfalseyes drainymildhighfalseyes erainycoolnormalfalseyes frainycoolnormaltrueno govercastcoolnormaltrueyes hsunnymildhighfalseno isunnycoolnormalfalseyes jrainymildnormalfalseyes ksunnymildnormaltrueyes lovercastmildhightrueyes movercasthotnormalfalseyes nrainymildhightrueno

15 Tree Stump for the ID attribute ID d… m n a a b c no yes no [0,1] [1,0] [0,1] Entropy([0,1],[0,1],[1,0],[1,0] …..[1,0],[0,1]) = 1/14*Entropy(0/1,1/1)+1/14*Entropy(0/1,1/1) + 1/14*Entropy(1/1,0/1)+ 1/14*Entropy(1/1,0/1)+ …+ 1/14*Entropy(1/1,0/1)+1/14*Entropy(0/1,1/1) = 1/14*0 + 1/14*0 + 1/14*0 …1/14*0 = 0 Gain(ID) = – 0 = bits

16 Highly branching attributes Weather data with ID has different id values for each instance –Knowing the id value is enough to predict the class –Therefore entropy for the stump with this attribute would be zero Gain is high (0.940) and therefore this attribute will be selected first for tree construction –The tree constructed would be useless Cannot predict new instances Tells nothing about structure of the decision The above situation arises whenever an attribute leads to high branching A split always results in entropy reduction –Because a split reduces the number of classes Information gain does not account for this intrinsic information of a split

17 Gain Ratio Gain ratio is a measure used to adjust for the intrinsic information of a split Intrinsic information of a split is computed using the number and size of the daughter nodes produced by the split –Without taking the class information into account Intrinsic Information for the ID stump –Entropy([1,1,1,…,1])=14*(-1/14log1/14)=3.807 bits Gain Ratio = Information Gain/Intrinsic Information Gain ratio for the ID stump = 0.940/3.807=0.247 Gain ratios for other stumps are –GainRatio(outlook) = 0.157, GainRatio(temperature)=0.019 –GainRatio(humidity) = 0.152, GainRatio(windy) = In this case, ID still wins but not by a huge margin Additional measures used to guard against such useless attributes

18 Gain Ratio 2 Gain ratio might overcompensate for intrinsic information –Because humidity splits the data only into two branches, its GainRatio is nearly equal to that of outlook Possible to fix this by choosing an attribute –with high gain ratio and –has InfoGain equal to the average of the InfoGains for all the attributes C4.5 (j4.8 in Weka) uses GainRatio and is an improvement over ID3 Selection of attributes for splitting are based on other measures such as ‘Gini’

19 Summary so far Attribute selection for splitting achieved using measures of ‘purity’ –Information Gain –Gain Ratio –Etc. Issues related to decision tree construction –Numerical Attributes –Missing values –Overfitting and Pruning –Studied next