Knowledge discovery & data mining: Classification

Slides:



Advertisements
Similar presentations
Knowledge discovery & data mining Classification & fraud detection Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa
Advertisements

Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Bayesian classifiers.
Classification and Prediction
Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification and Prediction
Classification.
Chapter 7 Decision Tree.
Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.
Classification and Prediction — Slides for Textbook — — Chapter 7 —
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
©Jiawei Han and Micheline Kamber
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Han: KDD --- Classification 1 Classification — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research.
1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
CS690L Data Mining: Classification
Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree.
Classification and Prediction
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Decision Trees.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Data Mining: Classification
KDD: Classification UCLA CS240A Notes*
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Data Mining: Concepts and Techniques Classification
Classification and Prediction — Slides for Textbook — — Chapter 7 —
Lecture 3 Classification and Prediction
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
A task of induction to find patterns
CS 685: Special Topics in Data Mining Jinze Liu
A task of induction to find patterns
August 6, 2019Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
Presentation transcript:

Knowledge discovery & data mining: Classification UCLA CS240A Winter 2002 Notes from a tutorial presented @ EDBT2000 By Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/

Module outline The classification task Main classification techniques Bayesian classifiers Decision trees Hints to other methods Discussion EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

The classification task Input: a training set of tuples, each labelled with one class label Output: a model (classifier) which assigns a class label to each tuple based on the other attributes. The model can be used to predict the class of new tuples, for which the class label is missing or unknown Some natural applications credit approval medical diagnosis treatment effectiveness analysis EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Classification systems and inductive learning Basic Framework for Inductive Learning Environment Testing Examples Training Examples Induced Model of Classifier Inductive Learning System ~ (x, f(x)) h(x) = f(x)? A problem of representation and search for the best hypothesis, h(x). Output Classification (x, h(x)) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Train & test The tuples (observations, samples) are partitioned in training set + test set. Classification is performed in two steps: training - build the model from training set test - check accuracy of the model using test set EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Train & test Kind of models Accuracy of models IF-THEN rules Other logical formulae Decision trees Accuracy of models The known class of test samples is matched against the class predicted by the model. Accuracy rate = % of test set samples correctly classified by the model. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Training step Classification Algorithms Training Data Classifier (Model) IF age = 30 - 40 OR income = high THEN credit = good EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Test step Classifier (Model) Test Data EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Prediction Classifier (Model) Unseen Data EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Machine learning terminology Classification = supervised learning use training samples with known classes to classify new data Clustering = unsupervised learning training samples have no class information guess classes or clusters in the data EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Comparing classifiers Accuracy Speed Robustness w.r.t. noise and missing values Scalability efficiency in large databases Interpretability of the model Simplicity decision tree size rule compactness Domain-dependent quality indicators EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Classical example: play tennis? Training set from Quinlan’s book EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Module outline The classification task Main classification techniques Bayesian classifiers Decision trees Hints to other methods Application to a case-study in fraud detection: planning of fiscal audits EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Estimating a-posteriori probabilities Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) P(X) is constant for all classes P(C) = relative freq of class C samples C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum Problem: computing P(X|C) is unfeasible! EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Naïve Bayesian Classification Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function Computationally easy in both cases EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Play-tennis example: estimating P(xi|C) outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14 EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Sample X is classified in class n (don’t play) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Module outline The classification task Main classification techniques Bayesian classifiers Decision trees Hints to other methods EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Decision trees A tree where internal node = test on a single attribute branch = an outcome of the test leaf node = class or class distribution A? B? C? Yes D? EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Classical example: play tennis? Training set from Quinlan’s book EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Decision tree obtained with ID3 (Quinlan 86) outlook overcast humidity windy high normal false true sunny rain N P EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

From decision trees to classification rules One rule is generated for each path in the tree from the root to a leaf Rules are generally simpler to understand than trees outlook overcast humidity windy high normal false true sunny rain N P IF outlook=sunny AND humidity=normal THEN play tennis EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Decision tree induction Basic algorithm top-down recursive divide & conquer greedy (may get trapped in local maxima) Many variants: from machine learning: ID3 (Iterative Dichotomizer), C4.5 (Quinlan 86, 93) from statistics: CART (Classification and Regression Trees) (Breiman et al 84) from pattern recognition: CHAID (Chi-squared Automated Interaction Detection) (Magidson 94) Main difference: divide (split) criterion EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Generate_DT(samples, attribute_list) = Create a new node N; If samples are all of class C then label N with C and exit; If attribute_list is empty then label N with majority_class(N) and exit; Select best_split from attribute_list; For each value v of attribute best_split: Let S_v = set of samples with best_split=v ; Let N_v = Generate_DT(S_v, attribute_list \ best_split) ; Create a branch from N to N_v labeled with the test best_split=v ; EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Criteria for finding the best split Information gain (ID3 – C4.5) Entropy, an information theoretic concept, measures impurity of a split Select attribute that maximize entropy reduction Gini index (CART) Another measure of impurity of a split Select attribute that minimize impurity 2 contingency table statistic (CHAID) Measures correlation between each attribute and the class label Select attribute with maximal correlation EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Information gain (ID3 – C4.5) E.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. Information needed to classify a sample in a set S containing p Pos and n Neg: fp = p/(p+n) fn = n/(p+n) I(p,n) = |fp ·log2(fp)| + |fn ·log2(fn)| If p=0 or n=0, I(p,n)=0. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Information gain (ID3 – C4.5) Entropy = information needed to classify samples in a split by attribute A which has k values This splitting results in partition {S1, S2 , …, Sk} pi (resp. ni ) = # elements in Si from Pos (resp. Neg) E(A) = j=1,…,k I(pi,ni) · (pi+ni)/(p+n) gain(A) = I(p,n) - E(A) Select A which maximizes gain(A) Extensible to continuous attributes EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Information gain - play tennis example outlook overcast humidity windy high normal false true sunny rain N P Choosing best split at root node: gain(outlook) = 0.246 gain(temperature) = 0.029 gain(humidity) = 0.151 gain(windy) = 0.048 Criterion biased towards attributes with many values – corrections proposed (gain ratio) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Gini index E.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. fp = p/(p+n) fn = n/(p+n) gini(S) = 1 – fp2 - fn2 If dataset S is split into S1, S2 then ginisplit(S1, S2 ) = gini(S1)·(p1+n1)/(p+n) + gini(S2)·(p2+n2)/(p+n) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Gini index - play tennis example outlook overcast rain, sunny 100% P …………… humidity normal high P 86% …………… Two top best splits at root node: Split on outlook: S1: {overcast} (4Pos, 0Neg) S2: {sunny, rain} Split on humidity: S1: {normal} (6Pos, 1Neg) S2: {high} EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Other criteria in decision tree construction Branching scheme: binary vs. k-ary splits categorical vs. continuous attributes Stop rule: how to decide that a node is a leaf: all samples belong to same class impurity measure below a given threshold no more attributes to split on no samples in partition Labeling rule: a leaf node is labeled with the class to which most samples at the node belong EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

The overfitting problem Ideal goal of classification: find the simplest decision tree that fits the data and generalizes to unseen data intractable in general A decision tree may become too complex if it overfits the training samples, due to noise and outliers, or too little training data, or local maxima in the greedy search Two heuristics to avoid overfitting: Stop earlier: Stop growing the tree earlier. Post-prune: Allow overfit, and then simplify the tree. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Stopping vs. pruning Stopping: Prevent the split on an attribute (predictor variable) if it is below a level of statistical significance - simply make it a leaf (CHAID) Pruning: After a complex tree has been grown, replace a split (subtree) with a leaf if the predicted validation error is no worse than the more complex tree (CART, C4.5) Integration of the two: PUBLIC (Rastogi and Shim 98) – estimate pruning conditions (lower bound to minimum cost subtrees) during construction, and use them to stop. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

If dataset is large Available Examples Divide randomly 70% 30% Training Set Test Set Generalization = accuracy check accuracy Used to develop one tree EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

If data set is not so large Cross-validation Available Examples Repeat 10 times 90% 10% Generalization = mean and stddev of accuracy Training Set Test. Set Used to develop 10 different tree Tabulate accuracies EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Categorical vs. continuous attributes Information gain criterion may be adapted to continuous attributes using binary splits Gini index may be adapted to categorical. Typically, discretization is not a pre-processing step, but is performed dynamically during the decision tree construction. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

categorical+continuous Summarizing … tool C4.5 CART CHAID arity of split binary and K-ary binary K-ary split criterion information gain gini index 2 stop vs. prune prune stop type of attributes categorical+continuous categorical EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Scalability to large databases What if the dataset does not fit main memory? Early approaches: Incremental tree construction (Quinlan 86) Merge of trees constructed on separate data partitions (Chan & Stolfo 93) Data reduction via sampling (Cattlet 91) Goal: handle order of 1G samples and 1K attributes Successful contributions from data mining research SLIQ (Mehta et al. 96) SPRINT (Shafer et al. 96) PUBLIC (Rastogi & Shim 98) RainForest (Gehrke et al. 98) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Classification with decision trees Reference technique: Quinlan’s C4.5, and its evolution C5.0 Advanced mechanisms used: pruning factor misclassification weights boosting factor EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Bagging and Boosting Bagging: build a set of classifiers from different samples of the same trainingset. Decision by voting. Boosting:assign more weight to missclassied tuples. Can be used to build the n+1 classifier, or to improve the old one. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Module outline The classification task Main classification techniques Decision trees Bayesian classifiers Hints to other methods EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Backpropagation Is a neural network algorithm, performing on multilayer feed-forward networks (Rumelhart et al. 86). A network is a set of connected input/output units where each connection has an associated weight. The weights are adjusted during the training phase, in order to correctly predict the class label for samples. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Backpropagation PROS High accuracy Robustness w.r.t. noise and outliers CONS Long training time Network topology to be chosen empirically Poor interpretability of learned weights EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Prediction and (statistical) regression f(x) x Regression = construction of models of continuous attributes as functions of other attributes The constructed model can be used for prediction. E.g., a model to predict the sales of a product given its price Many problems solvable by linear regression, where attribute Y (response variable) is modeled as a linear function of other attribute(s) X (predictor variable(s)): Y = a + b·X Coefficients a and b are computed from the samples using the least square method. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

Other methods (not covered) K-nearest neighbors algorithms Case-based reasoning Genetic algorithms Rough sets Fuzzy logic Association-based classification (Liu et al 98) EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

References - classification C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery, Sept. 1999. F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, Aug. 1999. J. Catlett. Megainduction: machine learning on very large databases. PhD Thesis, Univ. Sydney, 1991. P. K. Chan and S. J. Stolfo. Metalearning for multistrategy and parallel learning. In Proc. 2nd Int. Conf. on Information and Knowledge Management, p. 314-323, 1993. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. KDD'95, August 1995. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998. B. Liu, W. Hsu and Y. Ma. Integrating classification and association rule mining. In Proc. KDD’98, New York, 1998. EDBT2000 tutorial - Class Konstanz, 27-28.3.2000

References - classification J. Magidson. The CHAID approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March 1996. S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey. Data Mining and Knowledge Discovery 2(4): 345-389, 1998 J. R. Quinlan. Bagging, boosting, and C4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996. R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. D. E. Rumelhart, G. E. Hinton and R. J. Williams. Learning internal representation by error propagation. In D. E. Rumelhart and J. L. McClelland (eds.) Parallel Distributed Processing. The MIT Press, 1986 EDBT2000 tutorial - Class Konstanz, 27-28.3.2000