Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

CHAPTER 9: Decision Trees
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Classification and Prediction
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Learning Chapter 18 and Parts of Chapter 20
Data Mining: Classification
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Machine Learning BY:Vatsal J. Gajera (09BCE010).  What is Machine Learning? It is a branch of artificial intelligence.It is a scientfic discipline concerned.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”
ID3 Algorithm Michael Crawford.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification and Prediction
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Data Mining Classification: Basic Concepts and Techniques
Classification and Prediction
Classification by Decision Tree Induction
Classification 1.
Presentation transcript:

Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Example ageincomeveteran college_ educated support_ hillary youthlowno youthlowyesno middle_agedlowno yes seniorlowno yes seniormediumnoyesno seniormediumyesnoyes middle_agedmediumnoyesno youthlownoyesno youthlownoyesno seniorhighnoyes youthlowno middle_agedhighnoyesno middle_agedmediumyes seniorhighnoyesno

Example ageincomeveteran college_ educated support_ hillary youthlowno youthlowyesno middle_agedlowno yes seniorlowno yes seniormediumnoyesno seniormediumyesnoyes middle_agedmediumnoyesno youthlownoyesno youthlownoyesno seniorhighnoyes youthlowno middle_agedhighnoyesno middle_agedmediumyes seniorhighnoyesno Class- labels

Example ageincomeveteran college_ educated support_ hillary middle_agedmediumno ????? no age youth middle_aged college_ educated income yes lowmediumhigh no senior noyes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES

Example ageincomeveteran college_ educated support_ hillary middle_agedmediumno yes (predicted) no age youth middle_aged college_ educated income yes lowmediumhigh no senior noyes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES ANSWER

Example no age youth middle_aged college_ educated income yes lowmediumhigh no senior noyes noyes Induced Rules: The youth do not support Hillary. All who are middle- aged and low-income support Hillary. Seniors support Hillary. Etc…A rule is generated for each leaf.

Example Induced Rules: The youth do not support Hillary. All who are middle-aged and low-income support Hillary. Seniors support Hillary. Nested IF-THEN: IF age == youth THEN support_hillary = no ELSE IF age == middle_aged & income == low THEN support_hillary = yes ELSE IF age = senior THEN support_hillary = yes

How do you construct one? 1.Select an attribute to place at the root node and make one branch for each possible value. 14 tuples; Entire Training Set 5 tuples4 tuples5 tuples age youth middle_aged senior

How do you construct one? 2.For each branch, recursively process the remaining training examples by choosing an attribute to split them on. The chosen attribute cannot be one used in the ancestor nodes. If at anytime all the training examples have the same class, stop processing that part of the tree.

How do you construct one? age=youthIncomeveteran college_ educated support_ hillary youthlowno youthlowyesno youthlownoyesno youthlownoyesno youthlowno age youth middle_aged senior

How do you construct one? age= middle_agedincomeveteran college_ educated supports_ hillary middle_agedlowno yes middle_agedmediumnoyesno middle_agedhighnoyesno middle_agedmediumyes no veteran age youth middle_aged senior yesno

veteran age youth middle_aged senior yes no age= middle_agedincomeveteran college_ educated supports_ hillary middle_agedlowno yes middle_agedmediumnoyesno middle_agedhighnoyesno middle_agedmediumyes

no veteran age youth middle_aged senior yes no age= middle_agedincomeveteran college_ educated supports_ hillary middle_agedlowno yes middle_agedmediumnoyesno middle_agedhighnoyesno middle_agedmediumyes college_ educated yes no

age= middle_agedincomeveteran=no college_ educated supports_ hillary middle_agedlowno yes middle_agedmediumnoyesno middle_agedhighnoyesno veteran age youth middle_aged yes no college_ educated yes no senior no

age= middle_agedincomeveteran=no college_ educated supports_ hillary middle_agedlowno yes middle_agedmediumnoyesno middle_agedhighnoyesno veteran age youth middle_aged yes no college_ educated yes no senior noyes

no veteran age youth middle_aged yes no college_ educated yes no senior noyes age=seniorincomeveteran college_ educated supports_ hillary seniorlowno yes seniormediumnoyesno seniormediumyesnoyes seniorhighnoyes seniorhighnoyesno

veteran age youth middle_aged yes no college_ educated yes no senior noyes age=seniorincomeveteran college_ educated supports_ hillary seniorlowno yes seniormediumnoyesno seniormediumyesnoyes seniorhighnoyes seniorhighnoyesno college_ educated yesno

veteran age youth middle_aged yes no college_ educated yes no senior noyes college_ educated yesno age=seniorincomeveteran college_ educated=yes supports_ hillary seniormediumnoyesno seniorhighnoyes seniorhighnoyesno income low medium high

no veteran age youth middle_aged yes no college_ educated yes no senior noyes college_ educated yesno age=seniorincomeveteran college_ educated=yes supports_ hillary seniormediumnoyesno seniorhighnoyes seniorhighnoyesno income low medium high No low-income college- educated seniors…

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno age=seniorincomeveteran college_ educated=yes supports_ hillary seniormediumnoyesno seniorhighnoyes seniorhighnoyesno income low medium high No low-income college- educated seniors… no “Majority Vote”

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno age=senior income= mediumveteran college_ educated=yes supports_ hillary seniormediumnoyesno income low medium high no

veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno age=senior income= mediumveteran college_ educated=yes supports_ hillary seniormediumnoyesno income low medium high no

veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno income low medium high no age=seniorincome=highveteran college_ educated=yes supports_ hillary seniorhighnoyes seniorhighnoyesno

veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yes no income low medium high no age=seniorincome=highveteran college_ educated=yes supports_ hillary seniorhighnoyes seniorhighnoyesno veteran yes no

veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yes no income low medium high no age=seniorincome=highveteran college_ educated=yes supports_ hillary seniorhighnoyes seniorhighnoyesno veteran yes no “Majority Vote” split… No Veterans ???

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yes no income low medium high no veteran yes no ??? age=seniorincomeveteran college_ educated=no supports_ hillary seniorlowno yes seniormediumyesnoyes

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno income low medium high no veteran yes no ??? age=seniorincomeveteran college_ educated=no supports_ hillary seniorlowno yes seniormediumyesnoyes

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno income low medium high no veteran yesno ??? yes

Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| )

Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| ) Amount of work at each tree level Max height of the tree

How do we minimize the cost? Optimal decision trees are NP-complete (shown by Hyafil and Rivest)

How do we minimize the cost? Optimal decision trees are NP-complete (shown by Hyafil and Rivest) Need Heuristic to pick “best” attribute to split on.

no veteran age youth middle_aged yes no college_ educated yes senior noyes college_ educated yesno income low medium high no veteran yesno ??? yes

How do we minimize the cost? Optimal decision trees are NP-complete (shown by Hyafil and Rivest) Most common approach is “greedy” Need Heuristic to pick “best” attribute to split on. “Best” attribute results in “purest” split Pure = all tuples belong to the same class

….A good split increase purity of all children nodes

Three Heuristics 1. Information gain 2. Gain Ratio 3. Gini Index

Information Gain Ross Quinlan’s ID3 (iterative dichotomizer 3rd) uses info gain as its heuristic. Heuristic based on Claude Shannon’s information theory.

HIGH ENTROPY LOW ENTROPY

Calculate Entropy for D D = Training SetD=14 m = num. of classesm=2 i = 1,…,m C i = distinct classC 1 = yes, C 2 = no C i,D = tuples in D of class C i C 1,D = yes, C 2,D = no p i = prob. a random tuple in p 1 = 5/14, p 2 = 9/14 D belongs to class C i =|C i,D |/|D|

= -[ 5/14 * log(5/14) + 9/14 * log(9/14)] = -[.3571 * * ] = -[ ] =.9400 bits Extremes: = -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit = -[ 1/14 * log(1/14) + 13/14 * log(13/14)] =.3712 bits = -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits

Entropy for D split by A A = attribute to split D onE.g. age v = distinct values of AE.g. youth, middle_aged, senior j = 1,…,v D j = subset of D where A=jE.g. All tuples where age=youth

Entropy age (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)] =.6324 bits Entropy income (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)] =.9140 bits Entropy veteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)] + 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)] =.8609 bits Entropy college_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)] + 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)] =.8921 bits

Information Gain Gain(A) = Entropy(D) - Entropy A (D) Set of tuples D Subset of D split on attribute A Choose the A with the highest Gain. decreases Entropy

Gain(A) = Entropy(D) - Entropy A (D) Gain(age) = Entropy(D) - Entropy age (D) = =.3076 bits Gain(income) =.0259 bits Gain(veteran) =.0790 bits Gain(college_educated) =.0479 bits

Entropy with values >2 Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)] = bits Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)] = bits

ssageincomeveteran college_ educated support_ hillary youthlowno youthlowyesno middle_agedlowno yes seniorlowno yes seniormediumnoyesno seniormediumyesnoyes middle_agedmediumnoyesno youthlownoyesno youthlownoyesno seniorhighnoyes youthlowno middle_agedhighnoyesno middle_agedmediumyes seniorhighnoyesno Added social security number attribute

ss no yes no yes no yes no …… Will Information Gain split on ss?

ss no yes no yes no yes no …… Will Information Gain split on ss? Yes, because Entropy ss (D) = 0. * Entropy ss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]

Gain ratio C4.5, a successor of ID3, uses this heuristic. Attempts to overcome Information Gain’s bias in favor of attributes with large number of values.

Gain ratio

Gain(ss) =.9400 SplitInfo ss (D) = GainRatio(ss) =.2406 Gain(age) =.3076 SplitInfo age (D) = GainRatio(age) =.1940

Gini Index CART uses this heuristic. Binary splits. Not biased toward multi-value attributes like Info Gain. age youth middle_aged senior age senior youth, middle_aged

Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2 v – 2 subsets.

Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2 v – 2 subsets. CALCULATE GINI INDEX ON EACH SUBSET

Gini Index

Miscellaneous thoughts Widely applicable to data exploration, classification and scoring tasks Generate understandable rules Better for predicting discrete outcomes than continuous (lumpy) Error-prone when # of training examples for a class is small Most business cases trying to predict few broad categories