Data Mining using Decision Trees Professor J. F. Baldwin.

Slides:



Advertisements
Similar presentations
Artificial Intelligence 11. Decision Tree Learning Course V231 Department of Computing Imperial College, London © Simon Colton.
Advertisements

Fuzzy Decision Trees Professor J. F. Baldwin. Classification and Prediction For classification the universe for the target attribute is a discrete set.
COMP3740 CR32: Knowledge Management and Adaptive Systems
Data Mining Lecture 9.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Machine Learning III Decision Tree Induction
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Classification Algorithms
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree Algorithm (C4.5)
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Algorithm
Induction of Decision Trees
Classification Continued
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Chapter 3. Decision Tree Learning
Building And Interpreting Decision Trees in Enterprise Miner.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science.
Machine Learning Queens College Lecture 2: Decision Trees.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Classification and Prediction
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Classification and Regression Trees
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
k-Nearest neighbors and decision tree
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Decision Trees (suggested time: 30 min)
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Data Mining Classification: Basic Concepts and Techniques
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Data Mining – Chapter 3 Classification
Machine Learning Chapter 3. Decision Tree Learning
Decision trees.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Presentation transcript:

Data Mining using Decision Trees Professor J. F. Baldwin

Decision Trees from Data Base ExAttAttAttConcept NumSizeColourShapeSatisfied 1medbluebrickyes 2smallredwedgeno 3smallredsphereyes 4largeredwedgeno 5largegreenpillaryes 6largeredpillarno 7largegreensphereyes Choose target : Concept satisfied Use all attributes except Ex Num

CLS - Concept Learning System - Hunt et al. Parent node Attribute V v1v2v3 Node with mixture of +ve and -ve examples Children nodes Tree Structure

CLS ALGORITHM 1. Initialise the tree T by setting it to consist of one node containing all the examples, both +ve and -ve, in the training set 2. If all the examples in T are +ve, create a YES node and HALT 3. If all the examples in T are -ve, create a NO node and HALT 4. Otherwise, select an attribute F with values v1,..., vn Partition T into subsets T1,..., Tn according to the values on F. Create branches with F as parent and T1,..., Tn as child nodes. 5. Apply the procedure recursively to each child node

Data Base Example Using attribute SIZE {1, 2, 3, 4, 5, 6, 7} SIZE med small large {1}{2, 3}{4, 5, 6, 7} YES Expand

Expanding {1, 2, 3, 4, 5, 6, 7} SIZE med small large {1}{2, 3} COLOUR {4, 5, 6, 7} SHAPE YES {2, 3} SHAPE wedge sphere {3} {2} noyes wedge sphere pillar {4} {7} {5, 6} COLOUR No Yes red {6} No green {5} Yes

Rules from Tree IF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) ))) OR (SIZE = small AND SHAPE = wedge) THEN NO IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) ) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = medium) THEN YES

Disjunctive Normal Form - DNF IF (SIZE = medium) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = large AND SHAPE = sphere) OR (SIZE = large AND SHAPE = pillar AND COLOUR = green THEN CONCEPT = satisfied ELSE CIONCEPT = not satisfied

ID3 - Quinlan ID3 = CLS + efficient ordering of attributes Entropy is used to order the attributes. Attributes are chosen in any order for the CLS algorithm. This can result in large decision trees if the ordering is not optimal. Optimal ordering would result in smallest decision Tree. No method is known to determine optimal ordering. We use a heuristic to provide efficient ordering which will result in near optimal ordering

Entropy For random variable V which can take values {v 1, v 2, …, v n } with Pr(v i ) = p i, all i, the entropy of V is given by Entropy for a fair dice = Entropy for fair dice with even score = = = Information gain = = Differences between entropies

Attribute Expansion AiAi T Expand attribute A i - a i1 a im T T Pr Equally likely unless specified Pr(A 1, …A i, …A n, T) Attributes Except Ai Pr(A 1, …A i-1, A i+1, …A n, T | A i = a i1 ) other attributes Pass probabilities corresponding to a i1 from above and re-normalise -equally likely again if previous equally likely

Expected Entropy for an Attribute AiAi T Attribute A i and target T - a i1 a im T T S(a i2 ) S(a i1 ) S(a im ) Expected Entropy for Ai = Pr Pass probabilities corresponding to t k from above for a i1 and re-normalise Pr(T | A i =a im )

How to choose attribute and Information gain Determine expected entropy for each attribute i.e. S(A i ), all i Choose s such that Expand attribute A s By choosing attribute A s the information gain is S - S(A s ) where Minimising expected entropy is equivalent to maximising Information gain

Previous Example ExAttAttAttConcept NumSizeColourShapeSatisfied 1medbluebrickyes 1/7 2smallredwedgeno 1/7 3smallredsphereyes 1/7 4largeredwedgeno 1/7 5largegreenpillaryes 1/7 6largeredpillarno 1/7 7largegreensphereyes 1/7 Pr Concept satisfied yes no Pr 4/7 3/7 S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99

Entropy for attribute Size AttConcept SizeSatisfied medyes 1/7 smallno 1/7 smallyes 1/7 largeno 2/7 largeyes 2/7 Pr Concept Satisfied no 1/2 yes 1/2 Pr small med Concept Satisfied yes 1 Pr Concept Satisfied no 1/2 yes 1/2 Pr large S(small) = 1 S(med) = 0 S(large) = 1 Pr(small) = 2/7 Pr(large) = 4/7 Pr(med) = 1/7 S(Size) = (2/7)1 + (1/7)0 + (4/7)1 = 6/7 = 0.86 Information Gain for Size = = 0.13

First Expansion AttributeInformation Gain SIZE0.13 COLOUR0.52 SHAPE0.7 choose max {1, 2, 3, 4, 5, 6, 7} SHAPE wedge brick pillar sphere {2, 4} NO {1} YES {5, 6} {3, 7} YES Expand

Complete Decision Tree {1, 2, 3, 4, 5, 6, 7} SHAPE wedge brick pillar sphere {2, 4} NO {1} YES {5, 6} {3, 7} YES COLOUR red green {6) NO {5} YES Rule: IF Shape is wedge OR Shape is brick OR Shape is pillar AND Colour is red OR Shape is sphere THEN NO ELSE YES

A new case AttAttAttConcept SizeColourShapeSatisfied medredpillar? SHAPE pillar COLOUR red ? = NO

Post Pruning Any Node S N examples in node n cases of C C is one of {YES, NO } Let C be class with most examples i.e majority E(S) Suppose we terminate this node and make it a leaf with classification C. What will be the expected error, E(S), if we use the tree for new cases and we reach this node. E(S) = Pr(class of new case is a class C)

Bayes Updating for Post Pruning Let p denote probability of class C for new case arriving at S We do not know p. Let f(p) be a prior probability distribution for p on [0, 1]. We can update this prior using Bayes updating with the information at node S. The information at node S is n C in S 1 0 Pr(n C in S | p) f(p) Pr(n C in S | p) f(p)dp f(p | n in S) =

Mathematics of Post Pruning Assume f(p) to be uniform over [0, 1] 1 0 dp f(p | n C in S) = p (1-p) nN – n p (1-p) nN – n E(S) = E (1 – p) f(p | n C in S) E(S) = 1 0 dp p (1-p) nN – n + 1 p (1-p) nN – n = N – n + 1 N + 2 dp using Beta Functions. The evaluation of the integral n! (N – n + 1)! (N + 2)! 1 0 dx x (1-x) ab = using Beta Functions

Post Pruning for Binary Case S S1S2Sm Error(S1) Error(S2) Error(Sm) P1 P2 Pm E(S) BackUpError(S) For any node S which is not a leaf node we can calculate BackUpError(S) = Pi Error(Si) i Error(S) = MIN {} P i = Num of examples in Si Num of examples in S For leaf nodes S i Error(S i ) = E(S i ) E(S) BackUpError(S) Decision: Prune at S if BackUpError(S) Error(S)

Example of Post Pruning Before Pruning a [6, 4] b [4, 2] c [2, 2] d [1, 2] [x, y] means x YES cases and y NO cases We underline Error(Sk) [3, 2] [1, 0] [1, 1] 0.5 [0, 1] [1, 0] PRUNE PRUNE means cut the sub- tree below this point

Result of Pruning After Pruning a [6, 4] [4, 2] c [2, 2] [1, 2] [1, 0]

Generalisation For the case in which we have k classes the generalisation for E(S) is = N – n + k – 1 N + k Otherwise, pruning method is the same. E(S)

Testing DataBase Training Set Test Set Learn rules using Training Set and Prune Test rules on this set and record % correct Test rules on Test Set record % correct % accuracy on test set should be close to that of training set. This indicates good generalisation Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.