Download presentation
Presentation is loading. Please wait.
Published byNigel Greene Modified over 8 years ago
1
Part II - Classification© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II - Classification Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.
2
Part II - Classification© Prentice Hall2 Classification Outline Classification Problem Overview Classification Problem Overview Regression Regression Similarity Measures Similarity Measures Bayesian Classification Bayesian Classification Decision Trees Decision Trees Rules Rules Neural Networks Neural Networks Goal: Provide an overview of the classification problem and introduce some of the basic algorithms
3
Part II - Classification© Prentice Hall3 Classification Problem Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D C where each t i is assigned to one class. Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D C where each t i is assigned to one class. Actually divides D into equivalence classes. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. Prediction is similar, but may be viewed as having infinite number of classes.
4
Part II - Classification© Prentice Hall4 Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Identify mushrooms as poisonous or edible. Predict when a river will flood. Predict when a river will flood. Identify individuals with credit risks. Identify individuals with credit risks. Speech recognition Speech recognition Pattern recognition Pattern recognition
5
Part II - Classification© Prentice Hall5 Classification Ex: Grading If x >= 90 then grade =A. If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If 60<=x<70 then grade =D. If x<50 then grade =F. If x<50 then grade =F. >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D
6
Part II - Classification© Prentice Hall6 Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter C Letter E Letter A Letter D Letter F Letter B
7
Part II - Classification© Prentice Hall7 Classification Techniques Approach: Approach: 1.Create specific model by evaluating training data (or using domain experts’ knowledge). 2.Apply model developed to new data. Classes must be predefined Classes must be predefined Most common techniques use DTs, NNs, or are based on distances or statistical methods. Most common techniques use DTs, NNs, or are based on distances or statistical methods.
8
Part II - Classification© Prentice Hall8 Defining Classes Partitioning Based Distance Based
9
Part II - Classification© Prentice Hall9 Issues in Classification Missing Data Missing Data –Ignore –Replace with assumed value Measuring Performance Measuring Performance –Classification accuracy on test data –Confusion matrix –OC Curve
10
Part II - Classification© Prentice Hall10 Height Example Data
11
Part II - Classification© Prentice Hall11 Classification Performance True Positive True NegativeFalse Positive False Negative
12
Part II - Classification© Prentice Hall12 Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment
13
Part II - Classification© Prentice Hall13 Operating Characteristic Curve
14
Part II - Classification© Prentice Hall14 Regression Assume data fits a predefined function Assume data fits a predefined function Determine best values for regression coefficients c 0,c 1,…,c n. Determine best values for regression coefficients c 0,c 1,…,c n. Assume an error: y = c 0 +c 1 x 1 +…+c n x n Assume an error: y = c 0 +c 1 x 1 +…+c n x n + Estimate error using mean squared error for training set:
15
Part II - Classification© Prentice Hall15 Linear Regression Poor Fit
16
Part II - Classification© Prentice Hall16 Classification Using Regression Division: Use regression function to divide area into regions. Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. Prediction: Use regression function to predict a class membership function. Input includes desired class.
17
Part II - Classification© Prentice Hall17Division
18
Part II - Classification© Prentice Hall18Prediction
19
Part II - Classification© Prentice Hall19 Classification Using Distance Place items in class they are “closest” to. Place items in class they are “closest” to. Must determine distance between an item and a class. Must determine distance between an item and a class. Classes represented by Classes represented by –Centroid: Central value. –Medoid: Representative point. –Individual points Algorithm: KNN Algorithm: KNN
20
Part II - Classification© Prentice Hall20 K Nearest Neighbor (KNN): Training set includes classes. Training set includes classes. Examine K items near item to be classified. Examine K items near item to be classified. New item placed in class with the most number of close items. New item placed in class with the most number of close items. O(n) for each tuple to be classified. O(n) for each tuple to be classified.
21
Part II - Classification© Prentice Hall21 KNN Algorithm
22
Part II - Classification© Prentice Hall22 DT Classification Partitioning based: Divide search space into rectangular regions. Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART Algorithms: ID3, C4.5, CART
23
Part II - Classification© Prentice Hall23 Decision Tree Given: –D = {t 1, …, t n } where t i = –D = {t 1, …, t n } where t i = –Database schema contains {A 1, A 2, …, A h } –Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that –Each internal node is labeled with attribute, A i –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class, C j
24
Part II - Classification© Prentice Hall24 DT Induction
25
Part II - Classification© Prentice Hall25 DT Splits Area Gender Height M F
26
Part II - Classification© Prentice Hall26 Comparing DTs Balanced Deep
27
Part II - Classification© Prentice Hall27 DT Issues Choosing Splitting Attributes Choosing Splitting Attributes Ordering of Splitting Attributes Ordering of Splitting Attributes Splits Splits Tree Structure Tree Structure Stopping Criteria Stopping Criteria Training Data Training Data Pruning Pruning
28
Part II - Classification© Prentice Hall28 Decision Tree Induction is often based on Information Theory So
29
Part II - Classification© Prentice Hall29 Information
30
Part II - Classification© Prentice Hall30 DT Induction When all the marbles in the bowl are mixed up, little information is given. When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction !
31
Part II - Classification© Prentice Hall31 Information/Entropy Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Entropy measures the amount of randomness or surprise or uncertainty. Information is maximized when entropy is minimized. Information is maximized when entropy is minimized.
32
Part II - Classification© Prentice Hall32 ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain: ID3 chooses split attribute with the highest information gain:
33
Part II - Classification© Prentice Hall33 Entropy log (1/p)p log (1/p)H(p,1-p)
34
Part II - Classification© Prentice Hall34 ID3 Example (Output1) Starting state entropy: Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Gain using gender: –Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 –Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 –Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 –Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: Gain using height: 0.4384 – (2/15)(1) = 0.3051 Choose height as first splitting attribute Choose height as first splitting attribute
35
Part II - Classification© Prentice Hall35 C4.5 ID3 favors attributes with large number of divisions ID3 favors attributes with large number of divisions Improved version of ID3: Improved version of ID3: –Missing Data –Continuous Data –Pruning –Rules –GainRatio:
36
Part II - Classification© Prentice Hall36 CART Create Binary Tree Create Binary Tree Uses entropy Uses entropy Formula to choose split point, s, for node t: Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree. P L,P R probability that a tuple in the training set will be on the left or right side of the tree.
37
Part II - Classification© Prentice Hall37 CART Example At the start, there are six choices for split point: At the start, there are six choices for split point: –P(Gender)= 2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 –P(1.6) = 0 –P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 –P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 –P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 –P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8 Split at 1.8
38
Part II - Classification© Prentice Hall38 DT Advantages/Disadvantages Advantages: Advantages: –Easy to understand. –Easy to generate rules Disadvantages: Disadvantages: –May suffer from overfitting. –Classifies by rectangular partitioning. –Does not easily handle nonnumeric data. –Can be quite large – pruning is necessary.
39
Part II - Classification© Prentice Hall39 Rules Perform classification using If-Then rules Perform classification using If-Then rules Classification Rule: r = Classification Rule: r = Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. May generate from from other techniques (DT, NN) or generate directly. Direct Algorithms: 1R, PRISM Direct Algorithms: 1R, PRISM
40
Part II - Classification© Prentice Hall40 Generating Rules from DTs
41
Part II - Classification© Prentice Hall41 Generating Rules Example
42
Part II - Classification© Prentice Hall42 1R Algorithm
43
Part II - Classification© Prentice Hall43 1R Example
44
Part II - Classification© Prentice Hall44 PRISM Algorithm
45
Part II - Classification© Prentice Hall45 PRISM Example
46
Part II - Classification© Prentice Hall46 Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Tree created based on looking at all classes. Rules have no ordering of predicates. Rules have no ordering of predicates. Only need to look at one class to generate its rules. Only need to look at one class to generate its rules.
47
Part II - Classification© Prentice Hall47 NN Typical NN structure for classification: Typical NN structure for classification: –One output node per class –Output value is class membership function value Supervised learning Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent Algorithms: Propagation, Backpropagation, Gradient Descent
48
Part II - Classification© Prentice Hall48 NN Issues Number of source nodes Number of source nodes Number of hidden layers Number of hidden layers Training data Training data Number of sinks Number of sinks Interconnections Interconnections Weights Weights Activation Functions Activation Functions Learning Technique Learning Technique When to stop learning When to stop learning
49
Part II - Classification© Prentice Hall49 Decision Tree vs. Neural Network
50
Part II - Classification© Prentice Hall50 Propagation Tuple Input Output
51
Part II - Classification© Prentice Hall51 NN Propagation Algorithm
52
Part II - Classification© Prentice Hall52 Example Propagation © Prentie Hall
53
Part II - Classification© Prentice Hall53 NN Learning Adjust weights to perform better with the associated test data. Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed. Unsupervised: No knowledge of correct classification needed.
54
Part II - Classification© Prentice Hall54 NN Supervised Learning
55
Part II - Classification© Prentice Hall55 Supervised Learning Possible error values assuming output from node i is y i but should be d i : Possible error values assuming output from node i is y i but should be d i : Change weights on arcs based on estimated error Change weights on arcs based on estimated error
56
Part II - Classification© Prentice Hall56 NN Backpropagation Propagate changes to weights backward from output layer to input layer. Propagate changes to weights backward from output layer to input layer. Delta Rule: w ij = c x ij (d j – y j ) Delta Rule: w ij = c x ij (d j – y j ) Gradient Descent: technique to modify the weights in the graph. Gradient Descent: technique to modify the weights in the graph.
57
Part II - Classification© Prentice Hall57 Backpropagation Error
58
Part II - Classification© Prentice Hall58 Backpropagation Algorithm
59
Part II - Classification© Prentice Hall59 Gradient Descent
60
Part II - Classification© Prentice Hall60 Gradient Descent Algorithm
61
Part II - Classification© Prentice Hall61 Output Layer Learning
62
Part II - Classification© Prentice Hall62 Hidden Layer Learning
63
Part II - Classification© Prentice Hall63 Types of NNs Different NN structures used for different problems. Different NN structures used for different problems. Perceptron Perceptron Self Organizing Feature Map Self Organizing Feature Map Radial Basis Function Network Radial Basis Function Network
64
Part II - Classification© Prentice Hall64 Perceptron Perceptron is one of the simplest NNs. Perceptron is one of the simplest NNs. No hidden layers. No hidden layers.
65
Part II - Classification© Prentice Hall65 Perceptron Example Suppose: Suppose: –Summation: S=3x 1 +2x 2 -6 –Activation: if S>0 then 1 else 0
66
Part II - Classification© Prentice Hall66 Self Organizing Feature Map (SOFM) Competitive Unsupervised Learning Competitive Unsupervised Learning Observe how neurons work in brain: Observe how neurons work in brain: –Firing impacts firing of those near –Neurons far apart inhibit each other –Neurons have specific nonoverlapping tasks Ex: Kohonen Network Ex: Kohonen Network
67
Part II - Classification© Prentice Hall67 Kohonen Network
68
Part II - Classification© Prentice Hall68 Kohonen Network Competitive Layer – viewed as 2D grid Competitive Layer – viewed as 2D grid Similarity between competitive nodes and input nodes: Similarity between competitive nodes and input nodes: –Input: X = –Input: X = –Weights: –Weights: –Similarity defined based on dot product Competitive node most similar to input “wins” Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased. Winning node weights (as well as surrounding node weights) increased.
69
Part II - Classification© Prentice Hall69 Radial Basis Function Network RBF function has Gaussian shape RBF function has Gaussian shape RBF Networks RBF Networks –Three Layers –Hidden layer – Gaussian activation function –Output layer – Linear activation function
70
Part II - Classification© Prentice Hall70 Radial Basis Function Network
71
Part II - Classification© Prentice Hall71 NN Advantages/Disadvantages Advantages: Advantages: –Can continue the learning process even after the training set has been applied. –Can easily be parallelized. Disadvantages: Disadvantages: –Difficult to understand. –May suffer from overfitting. –Structure of graph must be determined apriori. –Input attribute values must be numeric. –Verification of correct functions of the NN may be difficult to perform.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.