Classification Continued

Slides:



Advertisements
Similar presentations
Classification and Prediction
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
IT 433 Data Warehousing and Data Mining
Decision Tree Approach in Data Mining
Classification Techniques: Decision Tree Learning
Lecture outline Classification Decision-tree classification.
Classification and Prediction
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
ICS 273A Intro Machine Learning
Classification and Prediction
Classification.
Chapter 7 Decision Tree.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Basic Data Mining Technique
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
Feature Selection: Why?
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Chapter 6 Classification and Prediction
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Classification Continued Decision Trees Classification Continued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Lets review the classification techniques we have seen so far, in terms of decision surfaces.

-3 -2 -1 1 2 3 4 5 6 7

-3 -2 -1 1 2 3 4 5 6 7 Linear Classifier

Nearest Neighbor Classifier

4 3 2 1 -1 -2 -2 -1 1 2 3 4 5 6

Decision Tree Classification A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Decision Tree Example I We have above data in our database, based upon this data, we want to predict if potential customers are likely to buy a computer. For example: will Joe, a 25 year old lumberjack with medium income and a fair credit rating buy a PC?

Decision Tree Example II Joe, a 25 year old lumberjack with medium income and a fair credit rating. ?, <=30, medium, no, fair Age? <=30 Student? 31to40 >40 Yes CreditRating? no yes excellent fair

How do we construct the decision tree? Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they can be discretized in advance) Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

if blood sugar > 3.5 then class = sick else class = healthy 10 1 2 3 4 5 6 7 8 9 Imagine this dataset shows two classes of people, healthy and sick. The X-axis shows their blood sugar count, the Y axis shows their white cell count. We want to find the single best rule of the form if somefeature > somevalue then class = sick else class = healthy if blood sugar > 3.5 then class = sick else class = healthy

10 1 2 3 4 5 6 7 8 9 Blood Sugar > 3.5? no yes Healthy sick

10 1 2 3 4 5 6 7 8 9

We have only informally shown how the decision tree chooses the splitting point for continuous attributes. How do we choose a splitting criteria for nominal or Boolean attributes? We want to find the single best rule of the form if somefeature = somevalue then class = sick else class = healthy M Gender F Height

Example of a problem that decision trees do poorly on.

We have now seen several classification algorithms We have now seen several classification algorithms. How should we compare them? Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provided by the model.

What happens if we run out of features to test before correctly partitioning the test set? Weight <=30 Blood PH < 7? 31 to 40 >40 Yes Height > 178? no yes < 5? Here we have a dataset with 3 features. Weight Blood PH Height We are trying to classify people into two classes, yes or no (ie yes, they will get sick or no they won’t). Most items are classified, but 28 individuals remain unclassified after using all features... 18 no 10 yes 18 healthy 12 sick

Feature Generation 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 Weight Height BMI BMI=kg/m2

Feature Generation Case Study Suppose we have the following two classes.. Class A: 100 random coin tosses Class B: A human “faking” 100 random coin tosses A 10100010101010101010101010101010110101…. A 11010101001010101000101010100101010101…. B 10100010101010101010101010101010101111…. B 10100101010101001100111010101010101010…. A 11110101010111101000111010101010111010….

What splitting criteria should we use?

How many bits do I need to label all the objects in this box? How many bits do I need to label all the objects in these boxes?

Information Gain as A Splitting Criteria Select the attribute with the highest information gain (information gain is the expected reduction in entropy). Assume there are two classes, P and N Let the set of examples S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as 0 log(0) is defined as 0

Information Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Note: entropy is at its minimum if the collection of objects is completely uniform

log(1) = 0 Entropy(9 ,5 ) = -(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940 Entropy(9 ,0 ) = -(9/9)log2(9/9) - (0/9)log2(0/9) = 0 Entropy(0 ,5 ) = -(0/5)log2(0/5) - (5/5)log2(5/5) = 0 log(1) = 0

) - Gain(A) = + = 0.940 log(1) = 0 Entropy(9 ,5 ) ( Entropy(9 ,0 )

Avoiding Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

Feature Selection One of the nice features of decision trees is that they automatically discover the best features to use (the ones near the top of the tree), and which features are irrelevant for the problem (the features which are no used). How do we decide which features to use for nearest neighbor, or the linear classifier? Suppose we are trying to decide if tomorrow is a good day to play tennis, based on the temperature, the windspeed, the humidity and the outlook… We could use just the temperature, or just {temperature, windspeed} or just {…} This sounds like a search problem!

Forward Selection Backward Elimination Bi-directional Search

-3 -2 -1 1 2 3 4 5 6 7 Nearest Neighbor Classifier