Download presentation
Presentation is loading. Please wait.
Published byKierra Gulick Modified over 9 years ago
1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition l Given a collection of records (training set) l Find a model for class attribute as a function of the values of other attributes. l Goal: previously unseen records should be assigned a class as accurately as possible. –A test set is used to determine the accuracy.
2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Illustrating Classification Task
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Examples of Classification Task l Predicting tumor cells as benign or malignant l Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc
4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Classification Techniques l Decision Tree based Methods l Rule-based Methods l Memory based reasoning l Neural Networks l Naïve Bayes and Bayesian Belief Networks l Support Vector Machines
5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree
6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data!
7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Decision Tree Classification Task Decision Tree
8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.
9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”
14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Decision Tree Classification Task Decision Tree
15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Decision Tree Induction l Many Algorithms: –Hunt’s Algorithm (one of the earliest) –CART –ID3, C4.5 –SLIQ,SPRINT
16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 General Structure of Hunt’s Algorithm l D t = set of training records of node t l General Procedure: –If D t only records of same class y t t leaf node labeled as y t –Else: use an attribute test to split the data. l Recursively apply the procedure to each subset. DtDt ?
17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married
18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes a local criterion. l Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting
19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting
20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 How to Specify Test Condition? l Depends on attribute types –Nominal(No order; e.g., Country) –Ordinal(Discrete, order; e.g., S,M,L,XL) –Continuous (Ordered, cont.; e.g., temperature) l Depends on number of ways to split –2-way split –Multi-way split
21
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Splitting Based on Nominal Attributes l Multi-way split: Use as many partitions as distinct values. l Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR
22
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Splitting Based on Continuous Attributes l Different ways of handling –Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. –Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more compute intensive
23
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Splitting Based on Continuous Attributes
24
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting
25
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?
26
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 How to determine the Best Split l Greedy approach: –Nodes with homogeneous class distribution are preferred l Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
27
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Measures of Node Impurity l Gini Index l Entropy l Misclassification error
28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 How to Find the Best Split B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34
29
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Measure of Impurity: GINI l Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). –Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information –Minimum (0.0) when all records belong to one class, implying most interesting information
30
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444
31
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Splitting Based on GINI l Used in CART, SLIQ, SPRINT. l When a node p is split into k partitions (children), the quality of split is computed as, where,n i = number of records at child i, n = number of records at node p.
32
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Binary Attributes: Computing GINI Index l Splits into two partitions l Effect of Weighing partitions: –Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333
33
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting
34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Stopping Criteria for Tree Induction l Stop expanding a node when all the records belong to the same class l Stop expanding a node when all the records have similar attribute values l Early termination (to be discussed later)
35
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Decision Tree Based Classification l Advantages: –Inexpensive to construct –Extremely fast at classifying unknown records –Easy to interpret for small-sized trees –Accuracy is comparable to other classification techniques for many simple data sets
36
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Practical Issues of Classification l Underfitting and Overfitting l Missing Values l Costs of Classification
37
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large Underfitting
38
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Overfitting due to Noise Decision boundary is distorted by noise point
39
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
40
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Notes on Overfitting l Overfitting results in decision trees that are more complex than necessary l Training error no longer provides a good estimate of how well the tree will perform on previously unseen records l Need new ways for estimating errors
41
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 How to Address Overfitting l Pre-Pruning (Early Stopping Rule) –Stop the algorithm before it becomes a fully- grown tree Stop if number of instances is less than some user- specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
42
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 How to Address Overfitting… l Post-pruning –Grow decision tree to its entirety –Trim the nodes of the decision tree in a bottom-up fashion –If generalization error improves after trimming, replace sub-tree by a leaf node. –Class label of leaf node is determined from majority class of instances in the sub-tree
43
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain reliable estimates? l Methods for Model Comparison –How to compare the relative performance among competing models?
44
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Metrics for Performance Evaluation l Focus on the predictive capability of a model –Rather than how fast it takes to classify or build models, scalability, etc. l Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
45
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Metrics for Performance Evaluation… l Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN)
46
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Limitation of Accuracy l Consider a 2-class problem –Number of Class 0 examples = 9990 –Number of Class 1 examples = 10 l If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % –Accuracy is misleading because model does not detect any class 1 example
47
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.