Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao

Definition Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class Find a model for class attribute as a function of the values of other attributes Goal: previously unseen records should be assigned a class as accurately as possible – A test set is used to determine the accuracy of the model – Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it 1

Example: Distinguish Dogs from Cats 2 Deep Blue beat Kasparov at chess in 1997. Watson beat the brightest trivia minds at Jeopardy in 2011. Can you tell Fido from Mittens in 2013? https://www.kaggle.com/c/dogs-vs-cats

Example: Distinguish Dogs from Cats What about this? 3 Cat Dog

Example: AlphaGo 4

Illustration 5

Classification—A Two-Step Process Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects – Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting) – If the accuracy is acceptable, use the model to classify data whose class labels are not known 6

Step 1: Model Construction 7 Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)

Step 2: Model Usage 8 Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Supervised vs. Unsupervised Learning Supervised learning (classification) – Supervision: the training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 9

Prediction Classification – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction – models continuous-valued functions, i.e., predicts unknown or missing values Applications – Credit/loan approval – Medical diagnosis: if a tumor is cancerous or benign – Fraud detection: if a transaction is fraudulent – Web page categorization: which category it is 10

Classification Techniques Decision tree based methods Naïve Bayes model Support Vector Machines (SVM) Ensemble methods Model evaluation 11 To be, or not to be, that is the question. --- William Shakespeare

Decision Tree Induction 12 categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree There could be more than one tree that fits the same data!

Decision Tree Classification Task 13

Apply Model to Test Data 14 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

Decision Tree Classification Task 20 Decision Tree

A General Algorithm Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 21

Decision Tree Induction Let D t be the set of training records that reach a node t General Procedure: – If D t contains records that belong the same class y t, then t is a leaf node labeled as y t – If D t is an empty set, then t is a leaf node labeled by the default class, y d – If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset 22 DtDt ?

Example 23 Training Data Refund Don’t Cheat ??? Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat ??? Single, Divorced Married

Decision Tree Induction Greedy strategy – Split the training data based on an attribute test that optimizes certain criterion Issues – How to split the training records? How to specify the attribute test condition? – Depend on attribute types: nominal, ordinal, continuous – Depend on number of ways to split How to determine the best split? – Nodes with homogeneous class distribution are preferred – Need a measure of node impurity – when to stop splitting? 24 CarType Family Sports Luxury Size {Small, Medium} {Large}

How to Determine the Best Split 25 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

How to Determine the Best Split 26 B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs. M0 – M34

Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t) – Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information – Minimum (0) when all records belong to one class, implying most interesting information 27

Example 28 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

Splitting Based on GINI Used in CART, SLIQ, SPRINT – Classification And Regression Tree (CART) by Leo Breiman – When a node p is split into k partitions (children), the quality of split is computed as, where,n i = number of records at child i n = number of records at node p 29

Example: Binary Attributes Splits into two partitions Effect of weighing partitions: – Larger and purer partitions are sought for 30 B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/7) 2 – (2/7) 2 = 0.408 Gini(N2) = 1 – (1/5) 2 – (4/5) 2 = 0.320 Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371

Example: Categorical Attributes For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions 31 Multi-way splitTwo-way split (find best partition of values)

Example: Continuous Attributes Choices for the splitting value – Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A  v Algorithm – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index 32 Split Positions Sorted Values

Alternative Splitting Criteria: Information Gain Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t) – Measures homogeneity of a node Maximum (log n c ) when records are equally distributed among all classes implying least information Minimum (0) when all records belong to one class, implying most information – Entropy based computation is similar to the GINI index computation 33

Example 34 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0.92

Information Gain Definition Parent Node, p is split into k partitions; n i is number of records in partition i – Measures reduction in Entropy achieved because of the split – Choose the split that achieves most reduction (maximizes information gain) – Used in ID3 and C4.5 by Ross Quinlan – Disadvantage: tends to prefer splits that result in large number of partitions, each being small but pure 35

Gain Ratio Gain Ratio: Parent Node, p is split into k partitions n i is the number of records in partition i – Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain 36

Splitting Based on Classification Error Classification error at a node t : (NOTE: p( i | t) is the relative frequency of class i at node t) Measures misclassification error made by a node Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0) when all records belong to one class, implying most interesting information 37

Example 38 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Comparison of Splitting Criteria For a 2-class problem 39

When to Stop? Stop expanding a node when – all the records belong to the same class – all the records have similar attribute values Early termination – Related to overfitting 40

Underfitting and Overfitting 41 Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise 42 Decision boundary is distorted by noise point

Overfitting due to Insufficient Examples 43 Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Overfitting Overfitting results in decision trees that are more complex than necessary – Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Occam’s Razor – Given two models of similar errors, one should prefer the simpler model over the more complex model – For complex models, there is a greater chance that it was fitted accidentally by errors in data 44

How to Address Overfitting Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same – More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain) 45

How to Address Overfitting Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree 46

Decision Tree Based Classification Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets Example: C4.5 – Simple depth-first construction and uses Information Gain – Sorts Continuous Attributes at each node – Needs entire data to fit in memory, so unsuitable for large datasets – You can download the software http://www.rulequest.com/Personal/c4.5r8.tar.gz http://www.rulequest.com/Personal/c4.5r8.tar.gz47

Issues Data Fragmentation – Number of instances gets smaller as you traverse down the tree – Number of instances at the leaf nodes could be too small to make any statistically significant decision Search Strategy – Finding an optimal decision tree is NP-hard – The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution – Other strategies? Bottom-up Bi-directional 48

Issues Expressiveness – Decision tree provides expressive representation for learning discrete-valued function – But they do not generalize well to certain types of Boolean functions Example: parity function: – Class = 1 if there is an even number of Boolean attributes = True – Class = 0 if there is an odd number of Boolean attributes = True For accurate modeling, must have a complete tree – Not expressive enough for modeling continuous variables Particularly when test condition involves only a single attribute at- a-time 49

Decision Boundary 50 Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time

Oblique Decision Trees 51 x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive

Issues Tree Replication – Same subtree appears in multiple branches 52

Nearest Neighbor Classification Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck Lazy learners – It does not build models explicitly 53 Training Records Test Record Compute Distance Choose k of the “nearest” records

Nearest Neighbor Classification 54 l Requires three things –The set of stored records –Distance Metric to compute distance between records –The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: –Compute distance to other training records –Identify k nearest neighbors –Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbors 55 K-nearest neighbors of a record x are data points that have the k smallest distances to x

Nearest Neighbor Classification Compute distance between two points: – Euclidean distance Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance weight factor, w = 1/d 2 56

Nearest Neighbor Classification Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M 57

Bayes Classification: Probability A probabilistic framework for solving classification problems 58 Marginal Probability Conditional Probability Joint Probability

Bayes Classification: Probability 59 Sum Rule Product Rule

Bayes Theorem 60 posterior  likelihood × prior Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

Naïve Bayes Classification Consider each attribute and class label as random variables – Given a record with attributes (A 1, A 2,…,A n ), – Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A 1, A 2,…,A n ) Can we estimate P(C| A 1, A 2,…,A n ) directly from data? Approach – compute the posterior probability P(C | A 1, A 2, …, A n ) for all values of C using the Bayes theorem 61

Naïve Bayes Classification Choose value of C that maximizes P(C | A 1, A 2, …, A n ) – Equivalent to choosing value of C that maximizes P(A 1, A 2, …, A n |C)*P(C) How to estimate P(A 1, A 2, …, A n | C )? – Assume independence among attributes A i when class is given: P(A 1, A 2, …, A n |C) = P(A 1 | C j ) P(A 2 | C j )… P(A n | C j ) – Can estimate P(A i | C j ) for all A i and C j. New point is classified to C j if P(C j )  P(A i | C j ) is maximal 62

How to Estimate Probabilities from Data? 63 l Class: P(C) = N c /N – e.g., P(No) = 7/10, P(Yes) = 3/10 l For discrete attributes P(A i | C k ) = |A ik |/ N c – where |A ik | is number of instances having attribute A i and belongs to class C k – Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

How to Estimate Probabilities from Data? For continuous attributes – Discretize the range into bins – Two-way split: (A v) choose only one of the two splits as new attribute – Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(A i |c) 64

How to Estimate Probabilities from Data? 65 l Normal distribution: – One for each (A i,c i ) pair l For (Income, Class=No): – If Class=No  sample mean = 110  sample variance = 2975

Smoothing If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: 67 c: number of classes p: prior probability m: parameter

Another Example 68 A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

Naïve Bayes: Summary Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes – Use other techniques such as Bayesian Belief Networks (BBN) 69

Model Evaluation Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Methods for Model Comparison – How to compare the relative performance among competing models? 70

Metrics for Performance Evaluation Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: 71 PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Limitation of Accuracy Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example 72 PREDICTED CLASS ACTUAL CLASS C(i|j)Class=YesClass=No Class=YesC(Yes|Yes)C(No|Yes) Class=NoC(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification 73 Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) +- +100 -10 Model M 1 PREDICTED CLASS ACTUAL CLASS +- +15040 -60250 Model M 2 PREDICTED CLASS ACTUAL CLASS +- +25045 -5200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost vs. Accuracy 74 Count PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes ab Class=No cd Cost PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes pq Class=No qp N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p)  Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Methods for Performance Evaluation Holdout – Reserve 2/3 for training and 1/3 for testing Random subsampling – Repeated holdout k times, accuracy = avg. of the accuracies Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n Stratified sampling Bootstrap – Sampling with replacement – Works well in small datasets 76

Methods For Model Comparison ROC (Receiver Operating Characteristics) – ROC curve plots TP (on the y-axis) against FP (on the x-axis) – Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point 77 (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: – Random guessing – Below diagonal line: prediction is opposite of the true class

Using ROC for Model Comparison 78 l No model consistently outperform the other l M 1 is better for small FPR l M 2 is better for large FPR l Area Under the ROC curve l Ideal:  Area = 1 l Random guess:  Area = 0.5

Test of Significance Given two models: – Model M1: accuracy = 85%, tested on 30 instances – Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? – How much confidence can we place on accuracy of M1 and M2? – Can the difference in performance measure be explained as a result of random fluctuations in the test set? Given x (# of correct predictions), acc=x/N, and N (# of test instances), can we predict p (true accuracy of model)? 79

Confidence Interval for Accuracy 80 For large test sets (N > 30), – Acc. has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Area = 1 -  Z  /2 Z 1-  /2

Confidence Interval for Accuracy 81 Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: – N=100, acc = 0.8 – Let 1-  = 0.95 (95% confidence) – From probability table, Z  /2 =1.96 1-  Z 0.992.58 0.982.33 0.951.96 0.901.65 N5010050010005000 p(lower)0.6700.7110.7630.7740.789 p(upper)0.8880.8660.8330.8240.811

Support Vector Machine (SVM) Problem: find a linear hyperplane (decision boundary) that will separate the data 82

SVM One possible solution 83

SVM Another possible solution 84

SVM Other possible solutions 85

SVM Which one is better? B 1 or B 2 ? How do you define better? 86

SVM Find a hyperplane maximizes the margin – B1 is better than B2 87

SVM 88

SVM 89 We want to maximize: – Which is equivalent to minimizing: – But subjected to the following constraints: – This is a constrained optimization problem Numerical approaches to solve it (e.g., quadratic programming)

Nonlinear SVM What if decision boundary is not linear? – Transform data into higher dimensional space, in which data points become linear separable 90 LibSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Ensemble Methods So far – we leverage ONE classifier for classification Can we use a collection (ensemble) of weak classifiers to produce a very accurate classifier? Or, can we make dumb learners smart? – Generate 100 different decision trees from the same or different training set and have them vote on the best classification for a new example Key motivation – Reduce the error rate. Hope is that it will become much more unlikely that the ensemble will misclassify an example 91

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers 92

Value of Ensembles “No Free Lunch” Theorem – No single algorithm wins all the time! When combing multiple independent and diverse classifiers, each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced Representative methods – Bagging: resampling training data – Boosting: reweighting training data – Random forest: random subset of features 93

Example 94 X XX X XX X X X X X

Ensemble Methods Suppose there are 25 base classifiers – Each classifier has error rate,  = 0.35 – Assume classifiers are independent – Probability that the ensemble classifier makes a wrong prediction: How to generate an ensemble of classifiers? – Bagging – Boosting 95

Bagging Create ensemble by “bootstrap aggregation” – Repeatedly randomly resampling the training data – Each sample has probability (1 – 1/n) n of not being selected 36.8% of samples will not be selected for training and thereby end up in the test 63.2% will for the training set – Build classifier on each bootstrap sample – Combine output by voting (e.g., majority voting) 96

Boosting Key insights – Instead of sampling (as in bagging), re-weigh examples! – Examples are given weights. At each iteration, a new classifier is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong – Final classification based on weighted vote of weak classifiers Construction – Start with uniform weighting – During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner 97

Example 98 Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Adaboost 99 Base classifiers: C 1, C 2, …, C T Error rate: Importance of a classifier:

AdaBoost Weight update: – If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated Classification: 100

Example 101 Data points for training Initial weights for each data point

Example 102

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao.

Similar presentations

Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao.

Similar presentations

Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Classification Peixiang Zhao."— Presentation transcript:

Similar presentations

About project

Feedback