KDD: Classification UCLA CS240A Notes*

KDD: Classification UCLA CS240A Notes*
_______________________________ * Notes Based on a EDBT 2000 tutorial By Fosca Giannotti and Dino Pedreschi Pisa KDD Lab

Module outline The classification task Main classification techniques
Bayesian classifiers Decision trees Hints to other methods Discussion

The classification task
Input: a training set of tuples, each labelled with one class label Output: a model (classifier) which assigns a class label to each tuple based on the other attributes. The model can be used to predict the class of new tuples, for which the class label is missing or unknown Some natural applications credit approval medical diagnosis treatment effectiveness analysis

Train & Test The tuples (observations, samples) are partitioned in training set + test set (72%--82% of data typically used for training and the rest for testing) Classification is performed in two steps: training-build the model from training set testing-check accuracy of model on test set

Models and Accuracy Kind of models Accuracy of models IF-THEN rules
Other logical formulae Decision trees Probabilistic models. Accuracy of models The known class of test samples is matched against the class predicted by the model. Accuracy rate = % of test set samples correctly classified by the model.

Training step Classification Algorithms Training Data Classifier
(Model) IF age = OR income = high THEN credit = good

Test step Classifier (Model) Test Data

Prediction Classifier (Model) Unseen Data

Machine learning terminology
Classification = supervised learning use training samples with known classes to classify new data Clustering = unsupervised learning training samples have no class information guess classes or clusters in the data

Comparing classifiers
Accuracy Speed Robustness w.r.t. noise and missing values Scalability efficiency in large databases Interpretability of the model Simplicity decision tree size rule compactness Domain-dependent quality indicators

Bayesian classifiers Decision trees Hints to other methods

Classical example: play tennis?
Training set from Quinlan’s book

p | n can mean many things. E.g.,
Forecasting the occurrence of a problem: traffic problem at a particular intersection failure of a particular equipment because of overheat, or Predict if weather-dependent activity will take place: Piscatorial activity tonight? Play-tennis or play-golf this afternoon? Real-Life examples can be more complex: Many columns, Several values in Class & Continuous attribute values Once built, the classifier takes tuples that specify weather conditions and decides whether this is a P or an N. Two main kinds of Classifiers: Bayesian Classifiers Decision Tree Classifiers

Algorithms for Classification:
Notes by Gregory Piatetsky

Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases. Supervised learning: classes are known for the examples used to build the classifier. A classifier can be a set of rules, a decision tree, a neural network, etc. Typical applications: credit approval, direct marketing, fraud detection, medical diagnosis, …..

Simplicity first Simple algorithms often work very well!
There are many kinds of simple structure, eg: One attribute does all the work All attributes contribute equally & independently A weighted linear combination might do Instance-based: use a few prototypes Use simple logical rules Success of method depends on the domain witten&eibe

Inferring rudimentary rules
1R: learns a 1-level decision tree I.e., rules that all test one particular attribute Basic version One branch for each value Each branch assigns most frequent class Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch Choose attribute with lowest error rate (assumes nominal attributes) witten&eibe

Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Note: “missing” is treated as a separate attribute value witten&eibe

Evaluating the weather attributes
* indicates a tie Outlook Temp Humidity Windy Play Sunny Hot High False No True Overcast Yes Rainy Mild Cool Normal Attribute Rules Errors Total errors Outlook Sunny  No 3/5 Sunny  Yes 2/5 Overcast Yes 4/4 Overcast No 0/4 Rainy  Yes 3/5 Rainy  No 2/5 4/14 Temp Hot  No* 2/4 Hot  Yes* 2/4 Mild  Yes 4/6 Mild  No 2/6 Cool  Yes 3/4 Cool  No 1/4 5/14 Humid. High  No 4/7 High  Yes 3/7 Normal  Yes 6/7 Normal  No 1/7 Windy False  Yes 2/8 False  No 2/8 True  No* 3/6 True  Yes* 3/6 witten&eibe

Dealing with Numeric attributes
A way to discretize numeric attributes is as follows: Divide each attribute’s range into intervals Sort instances according to attribute’s values Place breakpoints where the class changes (the majority class) to minimize the total error Example: temperature from weather data Outlook Temperature Humidity Windy Play Sunny 85 False No 80 90 True Overcast 83 86 Yes Rainy 75 … 81 Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No witten&eibe

The problem of overfitting
This procedure is very sensitive to noise One instance with an incorrect class label will probably produce a separate interval Also: time stamp attribute will have zero errors Simple solution: enforce minimum number of instances in majority class per interval witten&eibe

With overfitting avoidance
Resulting rule set: Attribute Rules Errors Total errors Outlook Sunny  No 2/5 4/14 Overcast  Yes 0/4 Rainy  Yes Temperature   Yes 3/10 5/14 > 77.5  No* 2/4 Humidity  82.5  Yes 1/7 3/14 > 82.5 and  95.5  No 2/6 > 95.5  Yes 0/1 Windy False  Yes 2/8 True  No* 3/6 witten&eibe

Probability-Based Classifiers
Overall P(p) = 9/14 P(n) = 5/14

How do we combine these Probabilities?
The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,wind=weak,…) For Sample X classify into C where P(C|X) is maximal

Estimating a-posteriori probabilities
Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) P(X) is constant for all classes Thus we need to find C such that P(C|X) = P(X|C)·P(C) is maximal P(C) = relative freq of class C samples—trivial estimation by counting Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification
Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated by a Gaussian density function Computationally easy in both cases Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England

The independence hypothesis…
… makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Bayesian classifiers Decision trees Hints to other methods

Decision trees A tree where internal node = test on a single attribute
branch = an outcome of the test leaf node = class or class distribution A? B? C? Yes D?

Decision Tree Classifiers Generated by ID3--Quinlan 86
outlook overcast humidity windy high normal false true sunny rain N P

From decision trees to classification rules
One rule is generated for each path in the tree from the root to a leaf Rules are generally simpler to understand than trees outlook overcast humidity windy high normal false true sunny rain N P IF outlook=sunny AND humidity=normal THEN play tennis

Decision tree Induction
Basic algorithm top-down recursive divide & conquer greedy (may get trapped in local maxima) Many variants: divide (split) criterion Multiway split vs. dichotomy When to stop

Decision Tree Algorithm
For each node N; If samples are all of class C then label N with C and exit; If attribute_list is empty then label N with majority_class(N) and exit; Otherwise: Greedly select the best_split attribute from attribute_list; For each value V of the attribute, generate a branch to a new node containing the tuples in the training set having V as value of that attribute Recursively apply this procedure to the new nodes so generated.

Classical example: play tennis?
Training set from Quinlan’s book

Example: first split on Outlook
All Tuples outlook sunny rain overcast All tuples where outlook=overcast All tuples where outlook=sunny All tuples where outlook=rain

Outlook=‘overcast’

Example cont All Tuples outlook sunny rain overcast
All tuples where outlook=overcast All tuples where outlook=sunny All tuples where outlook=rain P done Etc. Etc.

Outlook=Sunny Remove tuples that are not sunny And the Outlook column

Decision tree Induction
Basic algorithm top-down recursive divide & conquer greedy (may get trapped in local maxima) Many variants: divide (split) criterion Multiway split vs. dichotomy When to stop

Criteria for finding the best split
Information gain (ID3 – C4.5) Entropy, an information theoretic concept, measures impurity of a split Select attribute that maximize entropy reduction Gini index (CART) Another measure of impurity of a split Select attribute that minimize impurity 2 contingency table statistic (CHAID) Measures correlation between each attribute and the class label Select attribute with maximal correlation

Gini index E.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. fp = p/(p+n) fn = n/(p+n) gini(S) = 1 – fp2 – fn2 If dataset S is split into S1, S2 then ginisplit(S1, S2 ) = gini(S1)·(p1+n1)/(p+n) + gini(S2)·(p2+n2)/(p+n)

Gini index - play tennis example
outlook overcast rain, sunny 100% P …………… humidity normal high P 86% …………… Two top best splits at root node: Split on outlook: S1: {overcast} (4Pos, 0Neg) S2: {sunny, rain} Split on humidity: S1: {normal} (6Pos, 1Neg) S2: {high}

Information Gain Information gain increases with the average purity of the subsets that an attribute produces Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (this can involve fractions of bits!) Formula for computing the entropy: witten&eibe

*Claude Shannon “Father of information theory” Born: 30 April 1916
Died: 23 February 2001 Claude Shannon, who has died aged 84, perhaps more than anyone laid the groundwork for today’s digital revolution. His exposition of information theory, stating that all information could be represented mathematically as a succession of noughts and ones, facilitated the digital manipulation of data without which today’s information society would be unthinkable. Shannon’s master’s thesis, obtained in 1940 at MIT, demonstrated that problem solving could be achieved by manipulating the symbols 0 and 1 in a process that could be carried out automatically with electrical circuitry. That dissertation has been hailed as one of the most significant master’s theses of the 20th century. Eight years later, Shannon published another landmark paper, A Mathematical Theory of Communication, generally taken as his most important scientific contribution. Shannon applied the same radical approach to cryptography research, in which he later became a consultant to the US government. Many of Shannon’s pioneering insights were developed before they could be applied in practical form. He was truly a remarkable man, yet unknown to most of the world. witten&eibe

Example: attribute “Outlook”, 1
Temperature Humidity Windy Play? sunny hot high false No true overcast Yes rain mild cool normal witten&eibe

Example: attribute “Humidity”
“Humidity” = “High”: “Humidity” = “Normal”: Expected information for attribute: Information Gain:

Computing the information gain
(information before split) – (information after split) Information gain for attributes from weather data: witten&eibe

Splitting using Information gain
outlook overcast humidity windy high normal false true sunny rain N P Choosing best split at root node: gain(outlook) = 0.247 gain(temperature) = 0.029 gain(humidity) = 0.151 gain(windy) = 0.048 Criterion biased towards attributes with many values – corrections proposed (gain ratio)

Other Important Decisions in Decision Tree Construction
Branching scheme: binary vs. k-ary splits categorical vs. continuous attributes Stop rule: how to decide that a node is a leaf: all samples belong to same class impurity measure below a given threshold no more attributes to split on no samples in partition Labeling rule: a leaf node is labeled with the class to which most samples at the node belong

The overfitting problem
Ideal goal of classification: find the simplest decision tree that fits the data and generalizes to unseen data intractable in general A decision tree may become too complex if it overfits the training samples, due to noise and outliers, or too little training data, or local maxima in the greedy search Two heuristics to avoid overfitting: Stop earlier: Stop growing the tree earlier. Post-prune: Allow overfit, and then simplify the tree.

Stopping vs. pruning Stopping: Prevent the split on an attribute (predictor variable) if it is below a level of statistical significance - simply make it a leaf (CHAID) Pruning: After a complex tree has been grown, replace a split (subtree) with a leaf if the predicted validation error is no worse than the more complex tree (CART, C4.5) Integration of the two: PUBLIC (Rastogi and Shim 98) – estimate pruning conditions (lower bound to minimum cost subtrees) during construction, and use them to stop.

Categorical vs. continuous attributes
Information gain criterion may be adapted to continuous attributes using binary splits Gini index may be adapted to categorical. Typically, discretization is not a pre-processing step, but is performed dynamically during the decision tree construction.

Scalability to large databases
What if the dataset does not fit main memory? Early approaches: Incremental tree construction (Quinlan 86) Merge of trees constructed on separate data partitions (Chan & Stolfo 93) Data reduction via sampling (Cattlet 91) Goal: handle order of 1G samples and 1K attributes Successful contributions from data mining research SLIQ (Mehta et al. 96) SPRINT (Shafer et al. 96) PUBLIC (Rastogi & Shim 98) RainForest (Gehrke et al. 98)

Checkpoint The classification task Main classification techniques
Bayesian classifiers Decision trees Other Prediction techniques (not covered) K-nearest neighbors classifiers Association-based classification Support Vector Machines Backpropagation (neural nets) Regression methods from statistics Boosting, bagging and classifier ensembles

Classifiers: Summary Among the most popular useful techniques of data mining Comparatively simple to understand and apply Still delivering good accuracy in most applications Boosting and bagging can be used improve the accuracy of classifier.

Boosting a single classifier
Build a first classifier using a training set Build a new classifier by assigning weight to the misclassified samples of the training set Repeat operation while accuracy keeps increasing In most cases accuracy will improve for a few iterations and then it oscillates.

Tipically n is between 5 & 20
Ensembles How to develop n different classifiers Tipically n is between 5 & 20 Available Examples 30% Test. Set 70% of 20% Repeat n times n Training Sets and classifiers Test Set Ensemble of classifiers are created using different sample sets (with replacemet) of the data Thus each tuple could appear in zero or more training sets

Classifier Ensembles:Boosting
Create an ensemble of classifiers Decisions are taken by majority voting The first K<N classifier take a decision Tuples that are missclassified by the first K are assigned a boosted weight in building the K+1 classifier Fast learner. Learns with few sample and shallow classifiers. Oversensitive to noise.

Classifier Ensembles: Bagging
Create an ensemble of classifiers Decisions are taken by voting Just count the votes—majority According to their strength Or their accuracy. Bagging in SQL?

KDD: Classification UCLA CS240A Notes*

Similar presentations

Presentation on theme: "KDD: Classification UCLA CS240A Notes*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KDD: Classification UCLA CS240A Notes*

Similar presentations

Presentation on theme: "KDD: Classification UCLA CS240A Notes*"— Presentation transcript:

Similar presentations

About project

Feedback