Chapter 7 Classification and Prediction

Slides:



Advertisements
Similar presentations
Classification and Prediction
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Classification Techniques: Decision Tree Learning
Bayesian Classification
Machine Learning Neural Networks
Classification and Prediction
Classification & Prediction
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Tree-based methods, neutral networks
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Classification Continued
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification and Prediction
Classification.
Chapter 4 Classification and Scoring
Gini Index (IBM IntelligentMiner)
Chapter 5 Data mining : A Closer Look.
Chapter 7 Decision Tree.
Rule Generation [Chapter ]
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Chapter 9 Neural Network.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification And Bayesian Learning
Classification and Prediction
Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.
Data Mining and Decision Support
Classification Today: Basic Problem Decision Trees.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Chapter 6 Classification and Prediction
Data Mining Lecture 11.
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
A task of induction to find patterns
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Chapter 7 Classification and Prediction

Classification vs. Prediction predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis

Introduction Classification and Prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends

Introduction Classification – predicts categorical labels Prediction - models continuous valued functions

Classification Techniques Decision tree induction Bayesian classification Bayesian belief networks Neural Networks K-nearest neighbor classifiers Case based reasoning Genetic Algorithms Rough sets Fuzzy logics Classification Techniques

Prediction Techniques Neural Networks Linear regression Non-linear regression Generalized linear regression Prediction Techniques

Classification 2 process A model is built describing a predetermined set of data classes or concepts The model is used for classification

Preparing data for classification and prediction Data cleaning Remove or reduce noise Treatment of missing values Relevance analysis Removing any irrelevant or redundant attribute from learning process Data transformation Generalized to higher level Normalization involved

Classification Process (1): Model Construction Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classification Process (2): Use the Model Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Accuracy of Model The percentage of test set samples that are correctly classified k-fold cross validation method (8-fold, 10-fold) Dataset is divided into eight subsets, 10 subsets Accuracy of the model based on Training set Test set

Accuracy of Model Training Set Derive classifier Estimate accuracy Data Test set

Comparing classification method Predictive accuracy The ability of the model to correctly predict the class label of new or previously unseen data Speed Computation cost involved in generating and using the model

Comparing classification method Robustness The ability of the model to make correct predictions given noisy data or data with missing values Scalability The ability to construct the model efficiently given large amounts of data

Comparing classification method Interpretability Level of understanding and insight that is provided by the model Compactness of the model: size of the tree, or the number of rules.

How is prediction different from classification?? Prediction can be viewed as the construction and use of a model to access the class of unlabeled sample.

Applications Credit Approval Finance Marketing Medical Diagnosis Telecommunications

Classification methods : Decision Tree Induction Flow chart like tree structure Internal nodes :- a test on an attribute Branch :- represents an outcome of the test Leaf nodes :- represent classes or class distributions

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes

Inducing a decision tree There are many possible trees let’s try it on a credit data How to find the most compact one that is consistent with the data?

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

Building a compact tree The key to building a decision tree - which attribute to choose in order to branch. The heuristic is to choose the attribute with the maximum Information Gain based on information theory. Another explanation is to reduce uncertainty as much as possible.

Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain S be a set consisting of s data samples Let si no of tuples in class Ci for i = {1, …, m} The expected information needed to classify Entropy of attribute A with distint values {a1,a2,…,av} Information gained by branching on attribute A Then how can we decide a test attribute on each node? One of the popular methods is using information gain measure, which we covered in chapter 5. It involves rather complicated equations, and I’ll not present the details here. Just basic ideas. The basic idea is that we select the attribute with the highest information gain. This information gain can be calculated from the expected information I and entropy of each attribute, E I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A

Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s. Hence Similarly,

Classification methods : Decision Tree Induction The unknown sample classification A path is traced from the root to the leaf node that holds the class prediction for that sample. Rules generation: Each attribute-value pair along a given path forms a conjunction in the rule antecedent and the leaf node is the consequent Examples Slide 21 Example 7.3 pg 271

Avoid Overfitting in Classification An tree may overfit the training data such as Good accuracy on training data but poor on test exmples Too many branches, some may reflect anomalies due to noise or outliers

Tree pruning To remove the least reliable branches which result: Faster classification Improve on correctly classify test data 2 approaches to avoid overfitting Prepruning Postpruning

Tree pruning: prepruning Halting its construction early Not to further split or partition Node become a leaf – hold the most frequent class among the subset samples Measures such as 2-square, information gain etc can be used to assess the goodness of split But it is difficult to determine appropriate threshold High threshold – oversimplified trees Low threshold – very little simplification

Tree pruning: postpruning Remove branches from a “fully grown” tree An algorithm called Cost Complexity can be used. It calculate the expected error on each subtree. If the pruning node leads to a greater expected error rate, then the subtree is kept other, it is pruned.

Minimize error rate is prefered Can be hybrid Postpruning require more computation than prepruning, lead to a more reliable tree

Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented. This reduces fragmentation, repetition, and replication

Bayesian Classification Bayesian classifier are statistical classifiers Can predict class membership probabilities Based on Bayes theorem

Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds

Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem Informally, this can be written as posterior =likelihood x prior / evidence Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Naïve Bayesian Classification There m classes C1,C2,..Cm Given an unknown data sample, X (i.e., having no class label = testing data), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j  i

Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: No dependence relation between attributes Example: We wish to predict the class label of an unknown sample using Naïve Bayesian, given the following table. The unknown sample is X= (“<=30”, “medium”, student=“yes”, “fair”)

Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 X belongs to class “buys_computer=yes”

Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks

Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent Provide a graphical model of causal relationships – learning can be perform Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Y Z P X

Bayesian Belief Network: An Example Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table (CPT) for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks

Learning Bayesian Networks Several cases Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purpose D. Heckerman, Bayesian networks for data mining

Linear Classification Binary Classification problem The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ Examples – SVM, Perceptron, Probabilistic Classifiers x x x x x x x x o x o o x o o o o o o o o o o

Use of Association Rules: Classification Classification: mine a small set of rules existing in the data to form a classifier or predictor. It has a target attribute (on the right side): Class attribute Association: has no fixed target, but we can fix a target.

Class Association Rules (CARs) Mining rules with a fixed target Right-hand-side of the rules are fixed to a single attribute, which can have a number of values E.g., X = a, Y = d  Class = yes X = b  Class = no Call such rules: class association rules

Mining Class Association Rules Itemset in class association rules: <condset, class_value> condset: a set of items item: attribute value pair, e.g., attribute1 = a class_value: a value in class attribute

Classification Based on Associations (CBA) Two steps: Find all class association rules Using a modified Apriori algorithm Build a classifier There can be many ways, e.g., Choose a small set of rules to cover the data Numeric attributes need to be discrertized.

Advantages of the CBA Model Existing classification systems use Table data. CBA can build classifiers using either Table form data or Transaction form data (sparse data) CBA is able to find rules that existing classification systems cannot.

Assoc. Rules can be Used in Many Ways for Prediction We have so many rules: Select a subset of rules Using Bayesian Probability together with the rules Using rule combinations … A number of systems have been designed and implemented.

Neural Networks Analogy to Biological Systems (great example of a good learning system) Massive Parallelism allowing for computational efficiency The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule

A Neuron - f mk å (Refer pg 309++) weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

A Neuron - f mk å weighted sum Input vector x output y Activation function weight vector w å w0 w1 wn x0 x1 xn

Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

NN Network structure Neuron Y accept input from neuron X1, X2 and X3. output signal for neuron X1, X2 & X3 is x1, x2 & x3. weight from neuron X1, X2 & X3 is w1, w2 & w3. input weight output X1 X2 X3 Y w1 w2 w3 Figure Simple Artificial Neuron

Network structure net input; y_in to neuron Y is the summation of weighted signal from X1, X2 and X3. y_in = w1x1 + w2x2 + w3x3. activation y for neuron Y obtained through function y = f(y_in). For example logistic sigmoid function.

Activation Function Binary Sigmoid Identity Function Bipolar Sigmoid Binary Step Function Bipolar Sigmoid

Architecture of NN Single layer One layer weight x4 x1 x2 x3 y1 y2 w11 Input layer Output layer

Architecture of NN Multi layer TWO layer weight x4 x1 x2 x3 z1 z2 v11 Input layer Hidden layer Output layer

Network Pruning Network pruning (simplify network structure) Fully connected network will be hard to articulate N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights Pruning: Remove some of the links without affecting classification accuracy of the network

Network Rule Extraction Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy Enumerate the output from the discretized activation values to find rules between activation value and output 3. Find the relationship between the input and activation value 4. Combine the above two to have rules relating the output to input

Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches

How to Estimated Classification Accuracy or Error Rates Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of examples Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data—k-fold cross-validation for data set with moderate size leave-one-out: for small size data

Scoring the data Scoring is related to classification. Normally, we are only interested a single class (called positive class), e.g., buyers class in a marketing database. Instead of assigning each test example a definite class, scoring assigns a probability estimate (PE) to indicate the likelihood that the example belongs to the positive class.

Ranking and lift analysis After each example is given a score, we can rank all examples according to their PEs. We then divide the data into n (say 10) bins. A lift curve can be drawn according how many positive examples are in each bin. This is called lift analysis. Classification systems can be used for scoring. Need to produce a probability estimate.

Lift curve Bin 1 2 3 4 5 6 7 8 9 10

What Is Prediction? Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions

Predictive Modeling in Databases Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction

Regress Analysis and Log-Linear Models in Prediction Linear regression: Y =  +  X Two parameters ,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd