Chapter 7 Classification and Prediction
Classification vs. Prediction predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis
Introduction Classification and Prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends
Introduction Classification – predicts categorical labels Prediction - models continuous valued functions
Classification Techniques Decision tree induction Bayesian classification Bayesian belief networks Neural Networks K-nearest neighbor classifiers Case based reasoning Genetic Algorithms Rough sets Fuzzy logics Classification Techniques
Prediction Techniques Neural Networks Linear regression Non-linear regression Generalized linear regression Prediction Techniques
Classification 2 process A model is built describing a predetermined set of data classes or concepts The model is used for classification
Preparing data for classification and prediction Data cleaning Remove or reduce noise Treatment of missing values Relevance analysis Removing any irrelevant or redundant attribute from learning process Data transformation Generalized to higher level Normalization involved
Classification Process (1): Model Construction Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classification Process (2): Use the Model Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?
Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Accuracy of Model The percentage of test set samples that are correctly classified k-fold cross validation method (8-fold, 10-fold) Dataset is divided into eight subsets, 10 subsets Accuracy of the model based on Training set Test set
Accuracy of Model Training Set Derive classifier Estimate accuracy Data Test set
Comparing classification method Predictive accuracy The ability of the model to correctly predict the class label of new or previously unseen data Speed Computation cost involved in generating and using the model
Comparing classification method Robustness The ability of the model to make correct predictions given noisy data or data with missing values Scalability The ability to construct the model efficiently given large amounts of data
Comparing classification method Interpretability Level of understanding and insight that is provided by the model Compactness of the model: size of the tree, or the number of rules.
How is prediction different from classification?? Prediction can be viewed as the construction and use of a model to access the class of unlabeled sample.
Applications Credit Approval Finance Marketing Medical Diagnosis Telecommunications
Classification methods : Decision Tree Induction Flow chart like tree structure Internal nodes :- a test on an attribute Branch :- represents an outcome of the test Leaf nodes :- represent classes or class distributions
Training Dataset This follows an example from Quinlan’s ID3
Output: A Decision Tree for “buys_computer” age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes
Inducing a decision tree There are many possible trees let’s try it on a credit data How to find the most compact one that is consistent with the data?
Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left
Building a compact tree The key to building a decision tree - which attribute to choose in order to branch. The heuristic is to choose the attribute with the maximum Information Gain based on information theory. Another explanation is to reduce uncertainty as much as possible.
Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain S be a set consisting of s data samples Let si no of tuples in class Ci for i = {1, …, m} The expected information needed to classify Entropy of attribute A with distint values {a1,a2,…,av} Information gained by branching on attribute A Then how can we decide a test attribute on each node? One of the popular methods is using information gain measure, which we covered in chapter 5. It involves rather complicated equations, and I’ll not present the details here. Just basic ideas. The basic idea is that we select the attribute with the highest information gain. This information gain can be calculated from the expected information I and entropy of each attribute, E I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s. Hence Similarly,
Classification methods : Decision Tree Induction The unknown sample classification A path is traced from the root to the leaf node that holds the class prediction for that sample. Rules generation: Each attribute-value pair along a given path forms a conjunction in the rule antecedent and the leaf node is the consequent Examples Slide 21 Example 7.3 pg 271
Avoid Overfitting in Classification An tree may overfit the training data such as Good accuracy on training data but poor on test exmples Too many branches, some may reflect anomalies due to noise or outliers
Tree pruning To remove the least reliable branches which result: Faster classification Improve on correctly classify test data 2 approaches to avoid overfitting Prepruning Postpruning
Tree pruning: prepruning Halting its construction early Not to further split or partition Node become a leaf – hold the most frequent class among the subset samples Measures such as 2-square, information gain etc can be used to assess the goodness of split But it is difficult to determine appropriate threshold High threshold – oversimplified trees Low threshold – very little simplification
Tree pruning: postpruning Remove branches from a “fully grown” tree An algorithm called Cost Complexity can be used. It calculate the expected error on each subtree. If the pruning node leads to a greater expected error rate, then the subtree is kept other, it is pruned.
Minimize error rate is prefered Can be hybrid Postpruning require more computation than prepruning, lead to a more reliable tree
Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented. This reduces fragmentation, repetition, and replication
Bayesian Classification Bayesian classifier are statistical classifiers Can predict class membership probabilities Based on Bayes theorem
Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds
Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem Informally, this can be written as posterior =likelihood x prior / evidence Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Naïve Bayesian Classification There m classes C1,C2,..Cm Given an unknown data sample, X (i.e., having no class label = testing data), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j i
Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: No dependence relation between attributes Example: We wish to predict the class label of an unknown sample using Naïve Bayesian, given the following table. The unknown sample is X= (“<=30”, “medium”, student=“yes”, “fair”)
Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)
Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 X belongs to class “buys_computer=yes”
Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks
Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent Provide a graphical model of causal relationships – learning can be perform Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Y Z P X
Bayesian Belief Network: An Example Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table (CPT) for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks
Learning Bayesian Networks Several cases Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purpose D. Heckerman, Bayesian networks for data mining
Linear Classification Binary Classification problem The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ Examples – SVM, Perceptron, Probabilistic Classifiers x x x x x x x x o x o o x o o o o o o o o o o
Use of Association Rules: Classification Classification: mine a small set of rules existing in the data to form a classifier or predictor. It has a target attribute (on the right side): Class attribute Association: has no fixed target, but we can fix a target.
Class Association Rules (CARs) Mining rules with a fixed target Right-hand-side of the rules are fixed to a single attribute, which can have a number of values E.g., X = a, Y = d Class = yes X = b Class = no Call such rules: class association rules
Mining Class Association Rules Itemset in class association rules: <condset, class_value> condset: a set of items item: attribute value pair, e.g., attribute1 = a class_value: a value in class attribute
Classification Based on Associations (CBA) Two steps: Find all class association rules Using a modified Apriori algorithm Build a classifier There can be many ways, e.g., Choose a small set of rules to cover the data Numeric attributes need to be discrertized.
Advantages of the CBA Model Existing classification systems use Table data. CBA can build classifiers using either Table form data or Transaction form data (sparse data) CBA is able to find rules that existing classification systems cannot.
Assoc. Rules can be Used in Many Ways for Prediction We have so many rules: Select a subset of rules Using Bayesian Probability together with the rules Using rule combinations … A number of systems have been designed and implemented.
Neural Networks Analogy to Biological Systems (great example of a good learning system) Massive Parallelism allowing for computational efficiency The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule
A Neuron - f mk å (Refer pg 309++) weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
A Neuron - f mk å weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn
Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi
NN Network structure Neuron Y accept input from neuron X1, X2 and X3. output signal for neuron X1, X2 & X3 is x1, x2 & x3. weight from neuron X1, X2 & X3 is w1, w2 & w3. input weight output X1 X2 X3 Y w1 w2 w3 Figure Simple Artificial Neuron
Network structure net input; y_in to neuron Y is the summation of weighted signal from X1, X2 and X3. y_in = w1x1 + w2x2 + w3x3. activation y for neuron Y obtained through function y = f(y_in). For example logistic sigmoid function.
Activation Function Binary Sigmoid Identity Function Bipolar Sigmoid Binary Step Function Bipolar Sigmoid
Architecture of NN Single layer One layer weight x4 x1 x2 x3 y1 y2 w11 Input layer Output layer
Architecture of NN Multi layer TWO layer weight x4 x1 x2 x3 z1 z2 v11 Input layer Hidden layer Output layer
Network Pruning Network pruning (simplify network structure) Fully connected network will be hard to articulate N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights Pruning: Remove some of the links without affecting classification accuracy of the network
Network Rule Extraction Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy Enumerate the output from the discretized activation values to find rules between activation value and output 3. Find the relationship between the input and activation value 4. Combine the above two to have rules relating the output to input
Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches
How to Estimated Classification Accuracy or Error Rates Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of examples Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data—k-fold cross-validation for data set with moderate size leave-one-out: for small size data
Scoring the data Scoring is related to classification. Normally, we are only interested a single class (called positive class), e.g., buyers class in a marketing database. Instead of assigning each test example a definite class, scoring assigns a probability estimate (PE) to indicate the likelihood that the example belongs to the positive class.
Ranking and lift analysis After each example is given a score, we can rank all examples according to their PEs. We then divide the data into n (say 10) bins. A lift curve can be drawn according how many positive examples are in each bin. This is called lift analysis. Classification systems can be used for scoring. Need to produce a probability estimate.
Lift curve Bin 1 2 3 4 5 6 7 8 9 10
What Is Prediction? Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions
Predictive Modeling in Databases Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction
Regress Analysis and Log-Linear Models in Prediction Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd