Download presentation
1
Data Mining: Classification
2
Classification What is Classification?
Classifying tuples in a database In training set E each tuple consists of the same set of multiple attributes as the tuples in the large database W additionally, each tuple has a known class identity Derive the classification mechanism from the training set E, and then use this mechanism to classify general data (in W)
3
Learning Phase Learning
Training data are analyzed by a classification algorithm The class label attribute is credit_rating The classifier is represented in the form of classification rules
4
Testing Phase Testing (Classification)
Test data are used to estimate the accuracy of the classification rules If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples
5
Classification by Decision Tree
A top-down decision tree generation algorithm: ID-3 and its extended version C4.5 (Quinlan’93): J.R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann, 1993
6
Decision Tree Generation
At start, all the training examples are at the root Partition examples recursively based on selected attributes Attribute Selection Favoring the partitioning which makes the majority of examples belong to a single class Tree Pruning (Overfitting Problem) Aiming at removing tree branches that may lead to errors when classifying test data Training data may contain noise, …
7
Another Examples Eye Hair Height Oriental Black Short Yes White Tall
Brown Blue Gold No 1 2 3 4 5 6 7 8 9 10 11
8
After the analysis, can you classify the following patterns?
(Black, Gold, Tall) (Blue, White, Short) Example distributions Black Short Black Tall White Short White Tall Gold Short Gold Tall Black + ? Brown ─ Blue
9
Decision Tree
10
Decision Tree
11
Decision Tree Generation
Attribute Selection (Split Criterion) Information Gain (ID3/C4.5/See5) Gini Index (CART/IBM Intelligent Miner) Inference Power These measures are also called goodness functions and used to select the attribute to split at a tree node during the tree generation phase
12
Decision Tree Generation
Branching Scheme Determining the tree branch to which a sample belongs Binary vs. K-ary Splitting When to stop the further splitting of a node Impurity Measure Labeling Rule A node is labeled as the class to which most samples at the node belongs
13
Decision Tree Generation Algorithm: ID3
ID: Iterative Dichotomiser (7.1) Entropy
14
Decision Tree Algorithm: ID3
15
Decision Tree Algorithm: ID3
16
Decision Tree Algorithm: ID3
17
Decision Tree Algorithm: ID3
yes
18
Decision Tree Algorithm: ID3
19
Another Example
20
Another Example
21
Decision Tree Generation Algorithm: ID3
22
Decision Tree Generation Algorithm: ID3
23
Decision Tree Generation Algorithm: ID3
24
Gini Index If a data set T contains examples from n classes, gini index, gini(T), is defined as where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index, gini(T), is defined as
25
Inference Power of an Attribute
A feature that is useful in inferring the group identity of a data tuple is said to have a good inference power to that group identity. In Table 1, given attributes (features) “Gender”, “Beverage”, “State”, try to find their inference power to “Group id”
28
Generating Classification Rules
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.