Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003

Outline Introduction to Classification Decision Tree – ID3 Neural Network – Backpropagation Bayesian Network

Purpose: Classification is the process that establishes classes with attributes from a set of instances in a database. The class of an object must be one from a finite set of possible, pre-determined class values, while attributes of the object are descriptions of the object potentially affecting its class. Techniques: ID3 and its descendants, backpropagation neural network, Bayesian Network, CN2, AQ family, etc. Classification

ID3 Approach ID3 uses an iterative method to build up decision trees, preferring simple trees over complex ones, on the theory that simple trees are more accurate classifiers of future inputs. ID3 accomplishes the development of a minimal tree by using an information theoretic approach. By determining the amount of information that can be gained by testing each possible attribute and selecting the one containing the largest amount of information, the decision tree can be optimized.

Sample Training Set

Example: Complex Decision Tree Temperature Outlook Windy P Sunny Overcast P Rain True N False P Humidity Outlook Windy Sunny Overcast P Rain True N False P HighNormal P Windy True P False N Windy True N False Humidity HighNormal P Outlook Sunny Overcast P Rain Nnull CoolMildHot

Example: Simple Decision Tree Outlook Sunny Overcast P Rain Windy True N False P Humidity HighNormal PN

Entropy Function Entropy of a set C of objects (examples): Set C (total objects = n = n 1 +n 2 +n 3 +n 4 ) Class 1 (n 1 ) Class 2 (n 2 ) Class 3 (n 3 ) Class 4 (n 4 ) E(C) = - ( n 1 /n) * log 2 (n 1 /n) - ( n 2 /n) * log 2 (n 2 /n) - ( n 3 /n) * log 2 (n 3 /n) - ( n 4 /n) * log 2 (n 4 /n)

Entropy Function (Cont’d) Entropy of a partial tree of C if a particular attribute is chosen for partitioning C:

Entropy Function (Cont’d) Set C (total objects = n = n 1 +n 2 +n 3 +n 4 ) Class 1 (n 1 ) Class 2 (n 2 ) Class 3 (n 3 ) Class 4 (n 4 ) E(C) = - ( n 1 /n) * log 2 (n 1 /n) - ( n 2 /n) * log 2 (n 2 /n) - ( n 3 /n) * log 2 (n 3 /n) - ( n 4 /n) * log 2 (n 4 /n) E(C 1 ) = - ( m 1 /m) * log 2 (m 1 /m) - ( m 2 /m) * log 2 (m 2 /m) - ( m 3 /m) * log 2 (m 3 /m) - ( m 4 /m) * log 2 (m 4 /m) Subset C 1 (m = m 1 +m 2 +m 3 +m 4 ) Set C is partitioned into subsets C 1, C 2,... by attribute A i E(C 2 ) = - ( p 1 /p) * log 2 (p 1 /p) - ( p 2 /p) * log 2 (p 2 /p) - ( p 3 /p) * log 2 (p 3 /p) - ( p 4 /p) * log 2 (p 4 /p) Subset C 2 (p = p 1 +p 2 +p 3 +p 4 )... E(A i ) = (m/n) * E(C 1 ) + (p/n) * E(C 2 ) +... Class 1 (m 1 ) Class 2 (m 2 ) Class 3 (m 3 ) Class 4 (m 4 ) Class 1 (p 1 ) Class 2 (p 2 ) Class 3 (p 3 ) Class 4 (p 4 )

Information Gain Due to Attribute Partition Set C (total objects = n = n 1 +n 2 +n 3 +n 4 ) Class 1 (n 1 ) Class 2 (n 2 ) Class 3 (n 3 ) Class 4 (n 4 ) Entropy of Set C = E(C) Subset C 1 (m = m 1 +m 2 +m 3 +m 4 ) Class 1 (m 1 ) Class 2 (m 2 ) Class 3 (m 3 ) Class 4 (m 4 ) Set C is partitioned into subsets C 1, C 2,... by attribute A i Subset C 2 (p = p 1 +p 2 +p 3 +p 4 )... Entropy of the partial tree of C (based on attribute A i ) = E(A i ) Thus, the information gain due to the partition by the attribute A i is G i = E(C) - E(A i ) Class 1 (p 1 ) Class 2 (p 2 ) Class 3 (p 3 ) Class 4 (p 4 )

ID3 Algorithm 1Start from the root node and assign the root node as the current node C. 2If all objects in the current node C belong to the same class, then stop (the termination condition for the current node C) else go to step 3. 3Calculate the entropy E(C) for the node C. 4Calculate the entropy E(A i ) of the partial tree partitioned by an attribute A i which has not been used as classifying attributes of the node C. 5Compute the information gain G i for the partial tree (i.e., G i =E(C) - E(A i )).

ID3 Algorithm (Cont’d) 6Repeat step 4 and 5 for each attribute which has not been used as classifying attributes of the node C. 7Select the attribute with the maximum information gain (max G i ) as the classifying attribute for the node C. 8Create child nodes C 1, C 2,..., and C n (assume the selected attribute has n values) for the node C; and assign objects in the node C to appropriate child nodes according to the values of the classifying attribute. 9Mark the selected attribute as a classifying attribute of each node C i. For each node C 1, assign it as the current node and go to step 2.

Example (See Slide 5) Current node C = root node of the tree. Entropy of the node C = E(C) = -(9/14)log 2 (9/14) - (5/14)log 2 (5/14) = 0.940 Class P: Objects 3, 4, 5, 7, 9, 10, 11, 12, 13 Class N: Objects 1, 2, 6, 8, 14

Example (Cont’d) Entropy of the partial tree based on the Outlook attribute: E(Outlook=Sunny) = -(3/5)log 2 (3/5) - (2/5)log 2 (2/5) = 0.971 E(Outlook=Overcast) = -(0/4)log 2 (0/4) - (4/4)log 2 (4/4) = 0 E(Outlook=Rain) = -(2/5)log 2 (2/5) - (3/5)log 2 (3/5) = 0.971 E(Outlook) = (5/14) * E(Outlook=Sunny) + (4/14) * E(Outlook=Overcast) + (5/14) * E(Outlook=Rain) = 0.694

Example (Cont’d) Information gain due to the partition by the Outlook attribute: G(Outlook) = E(C) - E(Outlook) = 0.246 Similarly, the information gains due to the partition by the Temperature, Humidity and Windy attributes, respectively, are: G(Temperature) = 0.029 G(Humidity) = 0.151 G(Windy) = 0.048 Thus, the Outlook attribute is selected as the classifying attribute for the current node C since its information gain is the largest among all of the attributes.

Example (Cont’d) Outlook Sunny Overcast P Rain The resulted partial decision tree is: The analysis continues for the node C1 and C2 until all of the leaf nodes are associated with objects of the same class. Objects: 1, 2, 8, 9, 11 Objects: 4, 5, 6, 10, 14 Objects: 3, 7, 12, 13

Example (Cont’d) Outlook SunnyOvercast P Rain Windy True N False P Humidity HighNormal PN The resulted final decision tree is: Objects: 3, 7, 12, 13 Objects: 1, 2, 8 Objects: 9, 11 Objects: 6, 14 Objects: 4, 5, 10

Issues of Decision Tree How to deal with continuous attribute. Pruning tree to make it not case-sensitive. A better metric than information gains to evaluate tree expansion. Information gains would prefer to attribute with more attribute-value.

Characteristics of Neural Network (“Connectionist”) Architecture Neural network consists of many simple interconnected processing elements. The processing elements are often grouped together into linear arrays called “layers”. A neural network always has an input layer and an output layer and may have or may not have “hidden” layers. Each processing elements has a number of input x i, which carry various w ji. The processing element sums the weighted inputs w ji x i and computes a single output signal y j that is a function f of that weighted sum.

Characteristics of Neural Network (“Connectionist”) Architecture (Cont’d) The function f, called the transfer function, is fixed for the life of the processing element. A typical transfer function is the sigmod function. The function f is the object of a design decision and cannot dynamically be changed. On the other hand, the weights w ji are variables and can dynamically be adjusted to produce a given output. This dynamic modification of weights is what allows a neural network to memorize information, to adapt, and to learn.

Neural Network Processing Element f x1x1 x2x2 xixi yjyj w j1 w j2 w j1...

Sigmod Function

Architecture of Three-Layer Neural Network... Output Layer Hidden Layer Input Layer

Backpropagation Network A fully connected, layered, feedforward and train backward neural network. Each unit (processing element) in one layer is connected in the forward direction to every unit in the next layer. A backpropagation network typically starts out with a random set of weights. The network adjusts its weights each time it sees an input-output pair. Each pair requires two stages: a forward pass and backward pass. The forward pass involves presenting a sample input to the network and letting activations flow until they reach the output layer.

Backpropagation Network (Cont’d) During the backward pass, the network’s actual output (from the forward pass) is compared with the target output and error estimates are computed for the output units. The weights connected to the output units can be adjusted in order to reduce those errors. We can then use the error estimates of the output units to derive error estimates for the units in the hidden layers. Finally, errors are propagated back to the connections stemming from the input units.

Issues of Backpropagation Network How to present data. How to decide number of layers. Learning strategy.

Bayesian Classification Bayesian classification is based on Bayes theorem. Bayesian classifier predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers, allow the representation of dependencies among subsets of attributes.

Let H be hypothesis, and X be a data sample P ( H | X ) is posterior probability of H given X. P ( X | H ) is posterior probability of X given H. P ( H ) is prior probability of H. P (X), P (H), and P (X|H) may be estimated from the given data. Bayes Theorem

Assume there being a n attributes, unknown class, data sample X = (x 1, x 2,…, x n ). The process to predict the class (C 1, C 2, …, C m ) X belongs to in Naïve Bayesian Classifier is as follows: 1. Compute the posterior probability, conditioned on X, for each class. 2. Assign X to the class that has the highest posterior probability, i.e. P ( C i | X ) > P ( C j | X )for 1  j  m, j  i Naïve Bayesian Classification

Due to, and P ( X ) is constant for all classes, only P ( X | C i ) P ( C i ) need be maximized. Besides, the naïve Bayesian Classifier assume that there are no dependence relationships among the attributes. Thus, Naïve Bayesian Classification (Cont’d)

To classify data sample X = (Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = False), we need to maximize P ( X | C i ) P ( C i ). –Compute P( C i ) P(Class = P) = 9/14 = 0.643 P(Class = N) = 5/14 = 0.357 –Compute P( X k | C i ) P(Outlook = Sunny | Class = P)= 2/9 = 0.222 P(Outlook = Sunny | Class = N)= 3/5 = 0.600 P(Temperature = Hot | Class = P)= 2/9 = 0.222 P(Temperature = Hot | Class = N)= 2/5 = 0.400 P(Humidity = Normal | Class = P)= 6/9 = 0.667 P(Humidity = Normal | Class = N)= 1/5 = 0.200 P(Windy = False | Class = P)= 6/9 = 0.667 P(Windy = False | Class = N)= 2/5 = 0.400 Example

–Compute P( X | C i ) P(X | Class = P) = 0.222 x 0.222 x 0.667 x 0.667 = 0.022 P(X | Class = N) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 –Compute P( X | C i )P( C i ) P(X | Class = P)P(Class = P) = 0.022 x 0.643 = 0.014 P(X | Class = N)P(Class = N) = 0.019 x 0.357 = 0.007 Conclude: X belongs to Class P Example (Cont’d)

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Similar presentations

Presentation on theme: "Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Similar presentations

Presentation on theme: "Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003."— Presentation transcript:

Similar presentations

About project

Feedback