SEEM Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Classification Algorithms
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree Algorithm (C4.5)
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Classification Techniques: Decision Tree Learning
Lazy vs. Eager Learning Lazy vs. eager learning
What we will cover here What is a classifier
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Classification: Decision Trees
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
SEG Tutorial 1 – Classification Decision tree, Naïve Bayes & k-NN CHANG Lijun.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Machine Learning Reading: Chapter Text Classification  Is text i a finance new article? PositiveNegative.
Classification.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Chapter 4: Algorithms CS 795.
Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
CS690L Data Mining: Classification
Classification and Prediction
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Elsayed Hemayed Data Mining Course
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
SEEM Tutorial 1 Classification: Decision tree Siyuan Zhang,
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Naïve Bayes Classifier
Classification Nearest Neighbor
Data Science Algorithms: The Basic Methods
Naïve Bayes Classifier
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Machine Learning: Lecture 3
Naïve Bayes Classifier
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
A task of induction to find patterns
Data Mining CSCI 307, Spring 2019 Lecture 15
Classification 1.
A task of induction to find patterns
Presentation transcript:

SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Classification: Definition Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN Goal: previously unseen records should be assigned a class as accurately as possible.

Decision Tree Goal Construct a tree so that instances belonging to different classes should be separated Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes

Attribute Selection Measure 1: Information Gain Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A

Attribute Selection Measure 2: Gain Ratio Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) GainRatio(A) = Gain(A)/SplitInfo(A)

Attribute Selection Measure 3: Gini index If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as Reduction in Impurity:

Example Outlook Temperature Humidity Wind Play Tennis Sunny >25 High Weak No Strong Overcast Yes Rain 15-25 <15 Normal

Tree induction example Entropy of data S Split data by attribute Outlook Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94 Sunny [2+,3-] S[9+, 5-] Outlook Overcast [4+,0-] Rain [3+,2-] Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25

Tree induction example Split data by attribute Temperature <15 [3+,1-] S[9+, 5-] Temperature 15-25 [5+,1-] >25 [2+,2-] Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14

Tree induction example Split data by attribute Humidity Split data by attribute Wind S[9+, 5-] Humidity High [3+,4-] Normal [6+, 1-] Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15 S[9+, 5-] Wind Weak [6+, 2-] Strong [3+, 3-] Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05

Tree induction example Outlook Temperature Humidity Wind Play Tennis Sunny >25 High Weak No Gain(Outlook) = 0.25 Gain(Temperature)=0.14 Gain(Humidity) = 0.15 Gain(Wind) = 0.05 Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Outlook Yes ?? Overcast Sunny Rain Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain 15-25 High Strong No

Entropy of branch Sunny Split Sunny branch by attribute Temperature Split Sunny branch by attribute Humidity Split Sunny branch by attribute Wind Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97 Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0.4 = 0.57 <15 [1+,0-] Sunny[2+,3-] Temperature 15-25 [1+,1-] >25 [0+,2-] Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))] = 0.97 – 0 = 0.97 Sunny[2+,3-] Humidity High [0+,3-] Normal [2+, 0-] Gain(Wind) = 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] = 0.97 – 0.95= 0.02 Sunny[2+, 3-] Wind Weak [1+, 2-] Strong [1+, 1-]

Tree induction example Outlook Sunny Overcast Rain Humidity Yes ?? High Normal No Yes

Split Rain branch by attribute Temperature Entropy of branch Rain Split Rain branch by attribute Temperature Split Rain branch by attribute Humidity Split Rain branch by attribute Wind Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97 Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))] = 0.97 – 0.95 = 0.02 <15 [1+,1-] Rain[3+,2-] Temperature 15-25 [2+,1-] >25 [0+,0-] Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] = 0.97 – 0.95 = 0.02 Rain[3+,2-] Humidity High [1+,1-] Normal [2+, 1-] Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0 = 0.97 Rain[3+,2-] Wind Weak [3+, 0-] Strong [0+, 2-]

Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Weak Strong No Yes No Yes

Bayesian Classification A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities where xi is the value of attribute Ai Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem. posteriori probability prior probability likelihood Model: compute from data

Naïve Bayes Classifier Problem: joint probabilities are difficult to estimate Naïve Bayes Classifier Assumption: attributes are conditionally independent

Example: Naïve Bayes Classifier t s g q h f F P(C=t) = 1/2 P(C=f) = 1/2 P(A=m|C=t) = 2/5 P(A=m|C=f) = 1/5 P(B=q|C=t) = 2/5 P(B=q|C=f) = 2/5 Test Record: A=m, B=q, C=?

Example: Naïve Bayes Classifier For C = t P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2 = 2/25 P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q) For C = f P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2 = 1/25 P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q) Conclusion: A=m, B=q, C=t Higher!

Nearest Neighbor Classification Input A set of stored records k: # of nearest neighbors Output Compute distance: Identify k nearest neighbors Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote)

Nearest Neighbor Classification A Discrete Example Calculate the distances: d(P1, Pn) = d(P2, Pn) = 3.80 d(P3, Pn) = 2.12 d(P4, Pn) = 1.12 d(P5, Pn) = 1.58 d(P6, Pn) = 2 d(P7, Pn) = 1 d(P8, Pn) = 2.12 Input Given 8 training instances P1 (4, 2)  Orange P2 (0.5, 2.5)  Orange P3 (2.5, 2.5)  Orange P4 (3, 3.5)  Orange P5 (5.5, 3.5)  Orange P6 (2, 4)  Black P7 (4, 5)  Black P8 (2.5, 5.5)  Black k = 1 & k = 3 New Instance: Pn (4, 4)  ?

Nearest Neighbor Classification k = 3 k = 1 P1 P2 P3 P4 P5 P6 P7 P8 Pn P1 P2 P3 P4 P5 P6 P7 P8 Pn

Nearest Neighbor Classification… Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Each attribute must follow in the same range Min-Max normalization Example: Two data records: a = (1, 1000), b = (0.5, 1) dis(a, b) = ?

Classification: Lazy & Eager Learning Two Types of Learning Methodologies Lazy Learning Instance-based learning. (k-NN) Eager Learning Decision-tree and Bayesian classification. ANN & SVM P1 P2 P3 P4 P5 P6 P7 P8 Pn P1 P2 P3 P4 P5 P6 P7 P8 Pn

Differences Between Lazy &Eager Learning Lazy Learning Do not require model building Less time training but more time predicting Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager Learning Require model building More time training but less time predicting Must commit to a single hypothesis that covers the entire instance space

Thank you & Questions?