Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.

Slides:



Advertisements
Similar presentations
Classification and Prediction
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Lecture outline Classification Decision-tree classification.
Classification and Prediction
Classification & Prediction
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Classification II.
ICS 273A Intro Machine Learning
Classification and Prediction
Classification.
Chapter 4 Classification and Scoring
Chapter 7 Decision Tree.
Classification and Prediction — Slides for Textbook — — Chapter 7 —
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Mohammad Ali Keyvanrad
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
Classification and Prediction
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Classification and Prediction
Classification and Prediction — Slides for Textbook — — Chapter 7 —
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
CS 685: Special Topics in Data Mining Jinze Liu
Classification and Prediction
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Decision Trees

2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ?  What are the other issues ?

3 Classification by Decision Tree Induction  Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of test Leaf nodes represent class labels or class distribution Age? Student? Credit? fair excellent >40 31…40  30 NOYES noyes NOYES

4 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno

5 Output: A Decision Tree for “buy_computer” Age? Student? Credit? fair excellent >40 31…40 <=30 NOYES noyes NOYES

6 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ?  What are the other issues ?

7 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes

8 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno

9 Construction of A Decision Tree for “buy_computer” ? [P1,…P14] Yes: 9, No:5

10 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes

11 Construction of A Decision Tree for “buy_computer” Age? >40 31…40 <=30 [P1,…P14] Yes: 9, No:5

12 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes

13 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno

14 Construction of A Decision Tree for “buy_computer” Age? >40 31…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 YES ? ?

15 Construction of A Decision Tree for “buy_computer” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Student? noyes YES ? [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 NOYES

16 Construction of A Decision Tree for “buy_computer” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Student? noyes YES [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 Credit? fairexcellent NOYES NO YES [P6,P14] Yes: 0, No:2 [P4,P5,P10] Yes: 3, No:0

17 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes

18 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ?  What are the other issues ?

19 Which Attribute is the Best?  The attribute most useful for classifying examples  Information gain An information-theoretic approach Measure how well an attribute separates the training examples Use the attribute with the highest information gain to split Minimize the expected number of tests needed to classify a new tuple How useful? How well separated? How pure splitting result? Information gain

20 Choosing an attribute  Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"  Patrons? is a better choice

21 Information theory  If there are n equally probable possible messages, then the probability p of each is 1/n  Information conveyed by a message is -log(p) = log(n)  E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message  In general, if we are given a probability distribution P = (p 1, p 2,.., p n )  Then the information conveyed by the distribution (aka entropy of P) is: I(P) = -(p 1 *log(p 1 ) + p 2 *log(p 2 ) p n *log(p n ))

22 Information theory II  Information conveyed by distribution (a.k.a. entropy of P): I(P) = -(p 1 *log(p 1 ) + p 2 *log(p 2 ) p n *log(p n ))  Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0  The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred  Entropy is the average number of bits/message needed to represent a stream of messages

23 Information for classification  If a set S of records is partitioned into disjoint exhaustive classes (C 1,C 2,..,C k ) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P) where P is the probability distribution of partition (C 1,C 2,..,C k ): P = (|C 1 |/|S|, |C 2 |/|S|,..., |C k |/|S|) C1C1 C2C2 C3C3 C1C1 C2C2 C3C3 High information Low information

24 Information, Entropy, and Information Gain  S contains s i tuples of class C i for i = {1,..., m}  Information measures “ the amount of info ” required to classify any arbitrary tuple where is the probability that an arbitrary tuple belongs to C i  Example: S contains 100 tuples, 25 belong to class C 1 and 75 belong to class C 2

25 Information, Entropy, and Information Gain  Information reflects the “ purity ” of the data set Low information value indicates high purity High information value indicates high diversity  Example: S contains 100 tuples 0 belongs to class C 1 and 100 belong to class C 2 50 belong to class C 1 and 50 belong to class C 2

26 Information for classification II  If we partition S w.r.t attribute X into sets {T 1,T 2,..,T n } then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of T i, i.e. the weighted average of Info(T i ): Info(X,T) =  |T i |/|S| * Info(T i ) C1C1 C2C2 C3C3 C1C1 C2C2 C3C3 High information Low information

27 Information gain  Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S)  This represents the difference between information needed to identify an element of S and information needed to identify an element of S after the value of attribute X has been obtained That is, this is the gain in information due to attribute X  We can use this to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root  The intent of this ordering is: To create small decision trees so that records can be identified after only a few questions To match a hoped-for minimality of the process represented by the records being considered (Occam’s Razor)

28 Information, Entropy, and Information Gain  S contains s i tuples of class C i for i = {1,..., m}  Attribute A has values {a 1,a 2,...,a v }  Let s ij be the number of tuples which belong to class C i, and have a value of a j in attribute A  Entropy of attribute A is  Information gained by branching on attribute A

29 Information, Entropy, and Information Gain  Let T j be the set of tuples having value a j in attribute A s 1j + …,+s mj = |T j | I(s 1j, …,s mj ) = I(T j )  Entropy of attribute A is Proportion of |Tj| over |S| Information of Tj

30 Information, Entropy, and Information Gain A=a2 A=a I(10,10)=1 I(10,20)=0.918 I(20,30)=0.971 I(40,60)=0.971 S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue) 30 tuples 50 tuples A=a1 20 tuples

31 Computing information gain French Italian Thai Burger EmptySomeFull Y Y Y Y Y YN N N N N N I(S) = - (.5 log log.5) = = 1 I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 + 1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) =.47 I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1 Gain (Pat, S) = =.53 Gain (Type, T) = 1 – 1 = 0

32 Regarding the Definition of Entropy…  On Text book Page 134 (Equ. 3.6)  On Text book Page 287 (Equ. 7.2)  Polymophism When entropy is defined on tuples, use Equ. 3.6 When entropy is defined on attribute, use Equ. 7.2

33 How well does it work? Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example

34 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ?  What are the other issues ?

35 Extracting Classification Rules from Trees  Represent knowledge in the form of IF-THEN rules  One rule is created for each path from root to a leaf  Each attribute-value pair along a path forms a conjunction  Leaf node holds class prediction  Rules are easier for humans to understand

36 Examples of Classification Rules Age? Student? Credit? fair excellent >40 31…40  30 NOYES noyes NOYES  Classification rules: 1. IF age = “<=30” AND student = “no” THEN buys_computer = “no” 2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” 3. IF age = “31…40” THEN buys_computer = “yes” 4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” 5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

37 Avoid Over-fitting in Classification  Generated tree may over-fit training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples  Two approaches to avoiding over-fitting Pre-pruning: Halt tree construction early—do not split a node if this would result in goodness measure falling below a threshold  Difficult to choose an appropriate threshold Post-pruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from training data to decide which is “best pruned tree”

38 Enhancements to basic decision tree induction  Dynamic discretization for continuous-valued attributes Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals  Handle missing attribute values Assign most common value of attribute Assign probability to each of possible values  Attribute construction Create new attributes based on existing ones that are sparsely represented Reduce fragmentation (no. of samples at branch becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)

39 Classification in Large Databases  Classification—a classical problem extensively studied by statisticians and machine learning researchers  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed  Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods

40 Scalable Decision Tree Induction Methods  SLIQ (EDBT’96 — Mehta et al.) Build an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure  PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)

41 Summary  What is a decision tree ? A flow-chart-like tree: internal nodes, branches, and leaf nodes  How to construct a decision tree ? What are the major steps in decision tree induction ?  Test attribute selection  Sample partition How to select the attribute to split the node ?  Select the attribute with the highest information gain Calculate the information of the node Calculate the entropy of the attribute Calculate the difference between the information and entropy  What are the other issues ?