Classification Today: Basic Problem Decision Trees.

Slides:

Advertisements

Similar presentations

Data Mining Lecture 9.

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Lecture Notes for Chapter 4 and towards the end from Chapter 5

Bayesian Classification

Ch5 Stochastic Methods Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Data Mining Techniques Outline

Classification & Prediction

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Chapter 7 Classification and Prediction

Induction of Decision Trees

Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.

Classification Continued

Lecture 5 (Classification with Decision Trees)

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Classification.

Chapter 4 Classification and Scoring

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Bayes Classification.

Classification Naïve Bayes, Decision Trees Pinker, Continuing Chapter 4.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Bayesian Decision Theory Making Decisions Under uncertainty 1.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Bayesian Networks. Male brain wiring Female brain wiring.

Inductive learning Simplest form: learn a function from examples

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Naive Bayes Classifier

Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Classification Techniques: Bayesian Classification

CS690L Data Mining: Classification

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Bayesian Classification

CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.

Classification And Bayesian Learning

Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Part II - Classification© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II - Classification Margaret H. Dunham Department of Computer.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Classification.

DECISION TREES An internal node represents a test on an attribute.

Bayesian Classification

Bayesian Classification Using P-tree

Data Mining Lecture 11.

Classification and Prediction

Classification Techniques: Bayesian Classification

Data Mining – Chapter 3 Classification

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —

CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu

Presentation transcript:

Classification Today: Basic Problem Decision Trees

Classification Problem Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D  C where each t i is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes.

Classification Ex: Grading If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If x<50 then grade =F. >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

Classification Techniques Approach: 1.Create specific model by evaluating training data (or using domain experts’ knowledge). 2.Apply model developed to new data. Classes must be predefined Most common techniques use DTs, or are based on distances or statistical methods.

Defining Classes Partitioning Based Distance Based

Issues in Classification Missing Data –Ignore –Replace with assumed value Measuring Performance –Classification accuracy on test data –Confusion matrix –OC Curve

Height Example Data

Classification Performance True Positive True NegativeFalse Positive False Negative

Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment

Operating Characteristic Curve

Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART

Decision Tree Given: –D = {t 1, …, t n } where t i = –Database schema contains {A 1, A 2, …, A h } –Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that –Each internal node is labeled with attribute, A i –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class, C j

DT Induction

DT Splits Area Gender Height M F

Comparing DTs Balanced Deep

DT Issues Choosing Splitting Attributes Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Training Data Pruning

Information/Entropy Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification – no surprise – entropy = 0

ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain:

ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = Gain using gender: –Female: 3/9 log(9/3)+6/9 log(9/6)= –Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = –Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = –Gain: – = Gain using height: – (2/15)(0.301) = Choose height as first splitting attribute

C4.5 ID3 favors attributes with large number of divisions Improved version of ID3: –Missing Data –Continuous Data –Pruning –Rules –GainRatio:

CART Create Binary Tree Uses entropy Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree.

CART Example At the start, there are six choices for split point (right branch on equality): –P(Gender)= 2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 –P(1.6) = 0 –P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = –P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = –P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = –P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8

Problem to Work On: Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds

Bayes Theorem (Recap) Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost; insufficient data

Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: The product of occurrence of say 2 elements x 1 and x 2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y 1,y 2 ],C) = P(y 1,C) * P(y 2,C) No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X|C i ) is known, assign X to the class with maximum P(X|C i )*P(C i )

Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= x x x =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 Multiply by P(Ci)s and we can conclude that X belongs to class “buys_computer=yes”

Naïve Bayesian Classifier: Comments Advantages : –Easy to implement –Good results obtained in most of the cases Disadvantages –Assumption: class conditional independence, therefore loss of accuracy –Practically, dependencies exist among variables –E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc –Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? –Bayesian Belief Networks

Classification Using Distance Place items in class to which they are “closest”. Must determine distance between an item and a class. Classes represented by –Centroid: Central value. –Medoid: Representative point. –Individual points Algorithm: KNN

K Nearest Neighbor (KNN): Training set includes classes. Examine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.)

KNN

KNN Algorithm