Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Data Pre-processing Data Cleaning : Sampling:
Decision Tree Learning
Decision Trees Decision tree representation ID3 learning algorithm
Decision tree software C4.5
C4.5 Demo Andrew Rosenberg CS /30/04. What is c4.5? c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision.
Classification Algorithms
Decision Tree Approach in Data Mining
Parallel Apriori Algorithm Using MPI Congressional Voting Records Çankaya University Computer Engineering Department Ahmet Artu YILDIRIM January 2010.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches,
Classification Techniques: Decision Tree Learning
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees Chapter 18 From Data to Knowledge.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
An Exercise in Machine Learning
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Midwestern State University, Wichita Falls TX 1 Computerized Trip Classification of GPS Data: A Proposed Framework Terry Griffin - Yan Huang – Ranette.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
CS690L Data Mining: Classification
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
W E K A Waikato Environment for Knowledge Aquisition.
An Exercise in Machine Learning
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Chapter 18 From Data to Knowledge
Teori Keputusan (Decision Theory)
Prepared by: Mahmoud Rafeek Al-Farra
Decision Trees: Another Example
ID3 Vlad Dumitriu.
Decision Tree Saed Sayad 9/21/2018.
ID3 Algorithm.
Machine Learning Week 1.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
CSCI N317 Computation for Scientific Applications Unit Weka
Decision Tree Concept of Decision Tree
INTRODUCTION TO Machine Learning
A task of induction to find patterns
A task of induction to find patterns
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC

C4.5 Classification using decision tree. Where to find the program? –C4.5 Release 8: by Ross Quinlan – Running under Unix Reference book: “C4.5: programs for machine learning” J.Ross Quinlan

C4.5 Files Names files (filestem.names) –provides names for classes, attributes, and attribute values. –Consists of a series of entries, each starting on a new line and ending with a period. The first entry gives the class names, separated by commas. The rest of the files consists a single entry for each attribute. –Begins with the attribute name followed by a colon, then a specification of the values that the attribute can take. –Four specifications are possible: »ignore; causes the value of the attribute to be disregarded »continuous; attribute has numeric values »discrete N; N is a positive integer, specifies that the attribute has no more than N discrete values »A list of names separated by commas;

Example: Golf.names Play, Don't Play. | class labels outlook: sunny, overcast, rain. temperature: continuous. humidity: continuous. windy: true, false.

C4.5 Files (cont) Data file (filestem.data) –Data file describe the training cases for generating the decision tree and/or rules –Each line describe one case, providing values for all the attributes and then the case’s class, separated by commas and terminated by a period –Attribute values must appear in the same order that the attributes were given in the names file –For missing or unknown data, use ? to specify Test file (filestem.test) –Use to evaluate the classifier you have produced –In exactly the same format as the data file

Example:Golf.data | outlook, temperature, humidity, windy, class label sunny, 85, 85, false, Don't Play sunny, 80, 90, true, Don't Play overcast, 83, 78, false, Play rain, 70, 96, ?, Play rain, 68, ?, false, Play rain, 65, 70, true, Don't Play overcast, 64, 65, true, Play sunny, 72, 95, false, Don't Play sunny, 69, 70, false, Play overcast, 72, 90, true, Play overcast, 81, 75, false, Play rain, 71, 80, true, Don't Play

Running the programs C4.5: decision tree generation “c4.5 –f filestem [-u]” -f filestem (Default: DF) used to specify the filestem of the task -u (Default: no test set) This option is invoked when a test file has been prepared Example: only training: “c4.5 –f../Data/vote” training and testing: “c4.5 –f../Data/vote –u”

c4.5 output C4.5 [release 8] decision tree generator Fri Sep 12 12:02: Options: File stem Read 300 cases (16 attributes) from../Data/vote.data Decision Tree: physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution = u: democrat (1.0) | adoption of the budget resolution = n: | | education spending = n: democrat (6.0) | | education spending = y: democrat (9.0) | | education spending = u: republican (1.0) physician fee freeze = y: | synfuels corporation cutback = n: republican (97.0/3.0) | synfuels corporation cutback = u: republican (4.0) | synfuels corporation cutback = y: | | duty free exports = y: democrat (2.0) | | duty free exports = u: republican (1.0) | | duty free exports = n: | | | education spending = n: democrat (5.0/2.0) | | | education spending = y: republican (13.0/2.0) | | | education spending = u: democrat (1.0) physician fee freeze = u: | water project cost sharing = n: democrat (0.0) | water project cost sharing = y: democrat (4.0) | water project cost sharing = u: | | mx missile = n: republican (0.0) | | mx missile = y: democrat (3.0/1.0) | | mx missile = u: republican (2.0) The numbers at the leaves, in the form (N) or (N/E) N is the sum of cases that reach the leaf E is the number of cases that belong to the classes other than the nominated class

c4.5 output(cont) Simplified Decision Tree: physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: | mx missile = n: democrat (3.0/1.1) | mx missile = y: democrat (4.0/2.2) | mx missile = u: republican (2.0/1.0)

c4.5 output(cont) Evaluation on training data (300 items): Before Pruning After Pruning Size Errors Size Errors Estimate 25 8( 2.7%) 7 13( 4.3%) ( 6.9%) << Evaluation on test data (135 items): Before Pruning After Pruning Size Errors Size Errors Estimate 25 7( 5.2%) 7 4( 3.0%) ( 6.9%) << (a) (b) <-classified as (a): class democrat 1 51 (b): class republican

Running the programs (cont) C4.5rules: rule induction Should only be used after running the decision tree program c4.5, since it reads the unpruned file containning the unprunned tree. “c4.5rules –f filestem [-u]” Example: c4.5rules –f../Data/vote

c4.5rules output C4.5 [release 8] rule generator Fri Sep 12 12:07: Options: File stem Read 300 cases (16 attributes) from../Data/vote Processing tree 0 Final rules from tree 0: Rule 2: physician fee freeze = n -> class democrat [98.4%] Rule 9: synfuels corporation cutback = y duty free exports = y -> class democrat [97.5%] … Rule 13: physician fee freeze = u mx missile = u -> class republican [50.0%] Default class: democrat

c4.5rules output(cont) Evaluation on training data (300 items): Rule Size Error Used Wrong Advantage % (0.6%) -1 (0|1) democrat % 3 0 (0.0%) 0 (0|0) democrat % 3 0 (0.0%) 0 (0|0) democrat % 97 3 (3.1%) 21 (23|2) republican % 15 2 (13.3%) 11 (13|2) republican % 2 0 (0.0%) 2 (2|0) republican % 2 0 (0.0%) 2 (2|0) republican Drop rule 2 Rule Size Error Used Wrong Advantage % 54 0 (0.0%) 0 (0|0) democrat % 3 0 (0.0%) 0 (0|0) democrat % 97 3 (3.1%) 21 (23|2) republican % 15 2 (13.3%) 11 (13|2) republican % 3 0 (0.0%) 3 (3|0) republican % 2 0 (0.0%) 2 (2|0) republican Tested 300, errors 9 (3.0%) << (a) (b) <-classified as (a): class democrat (b): class republican Evaluation on test data (135 items): Rule Size Error Used Wrong Advantage % 24 2 (8.3%) 0 (0|0) democrat % 1 0 (0.0%) 0 (0|0) democrat % 41 0 (0.0%) 6 (6|0) republican % 8 3 (37.5%) 2 (5|3) republican % 2 0 (0.0%) 2 (2|0) republican Tested 135, errors 7 (5.2%) << (a) (b) <-classified as (a): class democrat 4 48 (b): class republican

confusion matrix & error rate Predicted class Actual class AB A 803 B 448 error rate of this classifier (4+3)/(83+52) = 5.2%

CBA Classification Based on Association –Download at –Use same data types as c4.5,i.e., *.names, *.data, and *.test –Refer to help topics –Discretization function, The discretization program sometime is not compatible with some systems, if errors occurs, then try to use the DOS version of the discretizer under the CBA directory. “discretize”

Data Repository online UCI machine learning repository ory.html