Lecture 3 Classification and Prediction

Lecture 3 Classification and Prediction
K417-Part II School of Knowledge Science TuBao Ho

Outline of lecture 3 What is Classification? What is Prediction?
Issues Regarding Classification and Prediction Classification by Decision Tree Induction Bayesian Classification Classification by Backpropagation Other Classification Methods K417-Part 2

Classification vs. Prediction
predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis K417-Part 2

Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae (classifiers) Model usage: for classifying future or unknown objects Estimate accuracy of the model: The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur K417-Part 2

Classification—A Two-Step Process
Model construction Model usage Classification Algorithms Unknown case H1 H2 H3 H4 Classifier (model) Cancerous? C1 C2 If color = dark and # tails = 2 Then cancerous cell training data K417-Part 2

Supervised vs. unsupervised learning
Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data When this class attribute is not available, the data is considered unsupervised (data without labels, unsupervised learning) swims has fins flies has lung is a fish Herring Cat Pigeon Flying fish Otter Cod Whale yes no K417-Part 2

Preparing the Data Data Cleaning
Noise Missing values Relevance analysis (feature selection) Irrelevant attributes Redundant attributes Data Transformation K417-Part 2

Comparing Classification Methods
Predictive accuracy: the ability of the classifier to correctly predict unseen data Speed: refers to computation cost Robustness: the ability of the classifier to make correctly predictions given noisy data or data with missing values Scalability: the ability to construct the classifier efficiently given large amounts of data Interpretability: the level of understanding and insight that is provided by the classifier K417-Part 2

Decision tree induction (DTI)
Decision to make: The London stock market will rise today? A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute each branch represents an outcome of the test leaf nodes represent classes or class distributions The top-most node in a tree is the root node days in the past Is unemployment high? YES NO The London market will rise today {day2, day3} Is the New York market rising today ? YES NO The London market will rise today {day1} The London market will not rise today {day4, day5, day6} K417-Part 2 leaf node

Decision tree induction (DTI)
Decision tree generation consists of two phases Tree construction Partition examples recursively based on selected attributes At start, all the training examples are at the root Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree K417-Part 2

General algorithm for tree construction
At each node of the tree: Choose the “best” attribute by a given selection measure Extend tree by adding new branch for each value of the attribute Sorting training examples to leaf nodes If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes {day1, day2, day3, day4,day5,day6} Is unemployment high? YES NO The London market will rise today Is the New York market rising today ? {day2, day3} {day1, day4,day5,day6} YES NO The London market will rise today {day1} The London market will not rise today {day4, day5, day6} K417-Part 2

Training data for the target concept Play-Tennis
X = <rain, hot, high, false> K417-Part 2

A decision tree for playing tennis
temperature cool hot mild {D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14} outlook wind outlook true false {D2} {D1, D3, D13} sunny o’cast rain {D8, D11} {D12} {D4, D10,D14} sunny rain o’cast {D9} {D5, D6} {D7} yes yes no yes wind humidity wind humidity true false {D11} {D8} high normal {D4, D14} {D10} true false {D5} {D6} high normal {D1, D3} {D3} no no yes outlook yes wind yes yes true false {D14} {D4} sunny rain o’cast {D1} {D3} yes no null yes no K417-Part 2

A simple decision tree for playing tennis
outlook sunny o’cast rain {D1, D2, D {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11} humidity yes wind high normal {D1, D2, D8} {D9, D10} true false {D6, D14} {D4, D5, D10} no yes no yes This tree is much simpler as “outlook” is selected at the root. How to select good attribute to split a decision node? K417-Part 2

Which attribute is the best?
A decision node S contains 19 positive examples (+) and 35 negative examples (-), denote by [19+, 35-] If attributes A1 and A2 (each has two values) split S into sub-nodes with proportions of positive and negative examples as below, which attribute is better? [19+, 35-] [19+, 35-] A2 = ? A1 = ? [11+, 5-] [8+, 30-] [8+, 33-] [11+, 2-] K417-Part 2

Entropy Entropy characterizes the impurity (purity) of an arbitrary collection of examples. S is the collection of positive and negative examples P is the proportion of positive examples in S p is the proportion of negative examples in S Entropy(S) = -p log2p -p log2p K417-Part 2

Entropy The entropy function relative to a boolean classification, as the proportion of, p , of positive examples varies between 0 and 1. entropy K417-Part 2

Example From 14 examples of Play-Tennis, 9 positive and 5 negative examples (denote by [9+, 5-] ) Entropy([9+, 5-] ) = - (9/14)log2(9/14) - (5/14)log2(5/14) = 0.940 Notice: 1. Entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p = 1), then p is 0, and Entropy(S) = -1. log2(1) - 0. log2 (0) = log2 (0) = 0. 2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. K417-Part 2

Information Gain Measures the Expected Reduction in Entropy
We define a measure, called information gain, of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the examples according to this attribute where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v. K417-Part 2

Information Gain Measures the Expected Reduction in Entropy
Values(Wind) = {Weak, Strong}, S = [9+, 5-] Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-] Gain(S, Wind) = Entropy(S) - = Entropy(S) - (8/14)Entropy(Sweak) - (6/14)Entropy(SStrong) = (8/14) (6/14)1.00 = 0.048 K417-Part 2

Which attribute is the best classifier?
Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] E = E = 0.592 [6+, 2-] [3+, 3-] E = E = 1.00 Gain(S, Humidity) = (7/14) (7/14).592 = .151 Gain(S, Wind) = (8/14) (6/14)1.00 = .048 K417-Part 2

Information gain of all attributes
Gain (S, Outlook) = 0.246 Gain (S, Humidity) = 0.151 Gain (S, Wind) = 0.048 Gain (S, Temperature) = 0.029 K417-Part 2

Next step in growing the decision tree
{D1, D2, ..., D14} [9+, 5-] Outlook Sunny Overcast Rain {D1, D2, D8, D9, D11} [2+, 3-] {D3, D7, D12, D13} [4+, 0-] {D4, D5, D6, D10, D14} [3+, 2-] ? Yes ? Which attribute should be tested here? Ssunny = {D1, D2, D3, D9, D11} Gain(Ssunny, Humidity) = (3/5)0.0 - (2/5)0.0 = 0.970 Gain(Ssunny, Temperature) = (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570 Gain(Ssunny, Wind) = (2/5)1.0 - (3/5)0.918 = 0.019 K417-Part 2

Stopping Condition 1. Every attribute has already been included
along this path through the tree 2. The training examples associated with each leaf node all have the same target attribute value (i.e., their entropy is zero) Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain) K417-Part 2

Over-fitting in Decision Trees
The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” K417-Part 2

Converting a Tree to Rules
outlook sunny o’cast rain wind humidity yes true false high normal no yes no yes IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes K417-Part 2

Attributes with Many Values
If attribute has many values (e.g., days of the month), ID3 will select it C4.5 uses GainRatio instead K417-Part 2

R-measure and others K417-Part 2

CABRO with R-measure in D2MS
T2.5D provides views in between 2D and 3D. The pruned tree learned from the stomach cancer data by CABRO has 1795 nodes where the active wide path to a node in focus displays all of its direct relatives in a 2D view. Top-left window: the plan tree; Bottom-left window: the summary table of performance metrics of discovered models; Bottom-right window: detail information of the model being activated (highlighted); Top-right window: tightly-coupled views of the model where the detail view in the right part corresponds to the field-of-view specified in the left part. K417-Part 2

Mining with decision trees
Attribute selection Pruning trees From trees to rules (high cost of pruning) Visualization Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage) (well-known systems: C4.5 and CART) K417-Part 2

Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (Mehta et al., 1996) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (J. Shafer et al., 1996) constructs an attribute list data structure PUBLIC (Rastogi & Shim, 1998) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (Gehrke, Ramakrishnan & Ganti, 1998) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) K417-Part 2

Why Bayesian Classification?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured K417-Part 2

Bayesian classification
The classification problem may be formalized using a-posteriori probabilities: P(C|X) = probability that the sample tuple X=<x1,…,xk> is of class C. Example: P(class=N | outlook=sunny, temperature=hot, humidity=high, wind=true) Idea: assign to sample X the class label C such that P(C|X) is maximal K417-Part 2

Estimating a-posteriori probabilities
Bayes theorem: P(X) is constant for all classes P(C) = relative frequency of class C samples C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum Problem: computing P(X|C) is unfeasible! K417-Part 2

Naïve Bayesian Classification
Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C) x … x P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative frequency of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated by a Gaussian density function Computationally easy in both cases K417-Part 2

The independence hypothesis …
It makes computation possible It yields optimal classifiers when satisfied but it is seldom satisfied in practice, as attributes (variables) are often correlated Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first K417-Part 2

Bayesian Belief Networks
Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Several cases of learning Bayesian belief networks Given both network structure and all the variables: easy Given network structure but only some variables When the network structure is not known in advance K417-Part 2

Bayesian Belief Networks
Bayesian belief networks allow class conditional independencies to be defined between subsets of variables First component: directed acyclic graph where each node represents a random variable, each arc represents a probabilistic independence Second component: one conditional probability table (CPT) for each variable Bayesian networks (belief networks, probabilistic networks) provide a model of causal relationship and can be learned K417-Part 2

Neural Networks Advantages Criticism
prediction accuracy is generally high robust, works when training examples contain errors output may be discrete, real-valued, or a vector of several discrete or real-valued attributes fast evaluation of the learned target function Criticism long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge K417-Part 2

A Multilayer Feed-Forward Neural Network
Backpropagation algorithm performs learning on a multilayer feed-forward neural network The inputs corresponds to the number of attributes measured for training samples The second layer are hidden units Output layer emits the network’s prediction for given sample Weights associated with arcs K417-Part 2

A Neuron mk - f weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping K417-Part 2

Defining a Network Topology
Before training, the user must decide on the network topology: number of layers, number of units in each layer Typically, input values are normalized so as to fall between 0.0 and 1.0. Discrete valued attributes may be encoded such that there is one input unit per domain value One output unit may be used to represent two classes. If more than two classes, then one output unit per class is used No clear rule as to the “best” number of hidden layer units. Network design is a trial-and-error process K417-Part 2

Network Training The ultimate objective of training Steps
obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps Initialize weights with random values Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error K417-Part 2

Backpropagation: Idea
Learns by iteratively processing a set of training samples, comparing the network prediction for each sample with its known class label (error) Weights are modified to minimize the mean squared error. The modification are made in the “backward” direction: from output layer, through each hidden layer down to the first layer Input: training sample (samples), the learning rate l, a multi feed-forward network (network) Output: A neural network trained to classify the samples K417-Part 2

Back-propagation: Idea
(5) Output vector (3), (4) (4) Output nodes (3) (3) Hidden nodes (4) wij (3), (4) (2) Input nodes (1) Input vector: xi K417-Part 2

Back-propagation algorithm
Initialize all weights and biases in network While stop condition is not satisfied { for each training sample X in samples { /* propagate inputs forward */ for each hidden or output layer unit j { Ij = i wijOi + j /* compute net input of unit j wrt previous layer */ Oj = 1/(1+e-Ij) } for each unit j in the output layer /* backpropagete the errors */ Errj = Oj(1 - Oj)(Tj – Oj) /* compute the error */ for each unit j in hidden layers, from the last to the first hidden layer Errj = Oj(1 - Oj) k Errkwjk /* compute error wrt next higher layer k */ for each weigh wij in network {  wij = (l) Errj Oi /* weigh increment */ wij = wij +  wij } /* weight update */ for each bias j in network j = (l) Errj /* bias increment */ j = j +  j /* bias update */ }} K417-Part 2

Other Classification Methods
k-Nearest Neighbor Classifiers Case-based Reasoning Genetic Algorithms Rough Set Approach Fuzzy Set Approaches K417-Part 2

Instance-based Classification
Using most similarity individual instances known in the past to classify a new instance Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space Locally weighted regression Constructs local approximation Case-based reasoning Uses symbolic representations and knowledge-based inference K417-Part 2

Genetics algorithms (GA)
GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules e.g., IF A1 and Not A2 THEN C2 can be encoded as 100 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation K417-Part 2

Rough set approach Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: A lower approximation (certain to be in C) an upper approximation (cannot be described as not belonging to C) Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity K417-Part 2

Fuzzy set approach Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated For a given new sample, more than one fuzzy value may apply Each applicable rule contributes a vote for membership in the categories K417-Part 2

Lecture 3 Classification and Prediction

Similar presentations

Presentation on theme: "Lecture 3 Classification and Prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3 Classification and Prediction

Similar presentations

Presentation on theme: "Lecture 3 Classification and Prediction"— Presentation transcript:

Similar presentations

About project

Feedback