C4.5 Demo Andrew Rosenberg CS4701 11/30/04. What is c4.5? c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision.

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Florida International University COP 4770 Introduction of Weka.

SAS 9.2 Getting Started. 2006/03/01. SAS Main Window.

Tekla Structures User Meeting 2008

From Decision Trees To Rules

Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.

Decision tree software C4.5

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

FAFSA Update Ozarks Technical Community College 14 th Annual Counselors’ Financial Aid Workshop.

An Online Microsoft Word Tutorial & Evaluation Begin.

Exporting Reports from DTMS into Outlook

Decision Tree Rong Jin. Determine Milage Per Gallon.

About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.

Millennium Scheduler. 2 Scheduler How to find “Scheduler” Overview Making a task Making a schedule Bugs ________________________________________ More.

Quality Assurance CS 615. Mission Statement The Quality Assurance team will provide assurance to stakeholders in CS-615/616 projects that their projects.

Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.

1 Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006.

SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :

Lesson 2 Filing Status and Filing Requirements. Objectives Apply the requirements for each of the five filing statuses Determine who must file Determine.

Building And Interpreting Decision Trees in Enterprise Miner.

Standard Deductions & Exemptions Schedule A Pub 17 Chapter 20 Pub 4012 Tab F and Tab 4 LEVEL 2 TOPIC Standard Deduction and Tax Computation v0.8.

Bookseller Search Guide Version 1.1 June 18, 2010 Trade Marketing Heidelberg, Sandra Cortes.

Part 4 Syntax or point-and-click?. British Social Attitudes 1986 Q.114, page 43b.

Machine Learning Queens College Lecture 2: Decision Trees.

Gesture Recognition & Machine Learning for Real-Time Musical Interaction Rebecca Fiebrink Assistant Professor of Computer Science (also Music) Princeton.

AI – CS289 Machine Learning - Labs Machine Learning – Lab 4 02 nd November 2006 Dr Bogdan L. Vrusias

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

Marginal Tax Rate Single Married Filing Jointly or Qualified Widow(er) Married Filing Separately Head of Household 10%$0 – $8,350 $0 – $16,700.

READING AND WRITING FILES. READING AND WRITING FILES SEQUENTIALLY  Two ways to read and write files  Sequentially and RA (Random Access  Sequential.

Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

1 NTTC Training 2012 QUALITY REVIEW Pub 4491Lesson 31 Pub 4012Tab 12 Form CSection C.

An Exercise in Machine Learning

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Data Mining and Decision Support

HACC Data Repository Using System Reports in the HACC Data Repository August, 2013.

Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)

On Line Microsoft Word Tutorial & Evaluation Begin.

CSV Files Intro to Computer Science CS1510 Dr. Sarah Diesburg.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

CLASSIFICATION: LOGISTIC REGRESSION Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Survey Training Pack Session 14 – Transferring CSPro, Access and Excel Files to SPSS.

HCAI Information for ACtion 2010

Creates the file on disk and opens it for writing

Step 1: Prepare data in Excel for mail merge

CASE 29 CLARINDA COMMUNITY HOSPITAL (Inventory Management)

COMP1942 Classification: More Concept Prepared by Raymond Wong

Automated Docketing With the ADI Wizard for Bankruptcy Courts

Machine Learning Week 1.

ريكاوري (بازگشت به حالت اوليه)

إستراتيجيات ونماذج التقويم

Using AMOS With SPSS Files.

Creates the file on disk and opens it for writing

Artificial Neural Networks

Opening Weka Select Weka from Start Menu Select Explorer Fall 2003

Supervised vs. unsupervised Learning

Lecture 05: Decision Trees

Label Name Label Name Label Name Label Name Label Name Label Name

مديريت موثر جلسات Running a Meeting that Works

CSCI N317 Computation for Scientific Applications Unit Weka

CS539 Project Report -- Evaluating hypothesis

-.&- ·Af& Q 0 "i'/

Monday, October 17: CS AP A Assignment -Create a netbeans Project with 3 class files. -create a method in each of the two class files you create.

DATA MANIPULATION Wendy Harrison Mari Morgan Dafydd Williams

Statistical Learning Introduction to Weka

CS639: Data Management for Data Science

Shih-Yang Su Virginia Tech

Types of Errors And Error Analysis.

Software Implementation

Data Mining CSCI 307, Spring 2019 Lecture 8

Presentation transcript:

C4.5 Demo Andrew Rosenberg CS /30/04

What is c4.5? c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes.

Running c4.5 On cunix.columbia.edu –~amr2104/c4.5/bin/c4.5 –u –f filestem On cluster.cs.columbia.edu –~amaxwell/c4.5/bin/c4.5 –u –f filestem c4.5 expects to find 3 files –filestem.names –filestem.data –filestem.test

File Format:.names The file begins with a comma separated list of classes ending with a period, followed by a blank line –E.g, >50K, <=50K. The remaining lines have the following format (note the end of line period): –Attribute: {ignore, discrete n, continuous, list}.

Example: census.names >50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, etc. occupation: Tech-support, Craft-repair, Other-service, Sales, etc. relationship: Wife, Own-child, Husband, Not-in-family, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc.

File Format:.data,.test Each line in these data files is a comma separated list of attribute values ending with a class label followed by a period. –The attributes must be in the same order as described in the.names file. –Unavailable values can be entered as ‘?’ When creating test sets, make sure that you remove these data points from the training data.

Example: adult.test 25, Private, , 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K. 38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K. 28, Local-gov, , Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K. 44, Private, , Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K. 18, ?, , Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K. 34, Private, , 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K. 29, ?, , HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United- States, <=50K. 63, Self-emp-not-inc, , Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K. 24, Private, , Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K. 55, Private, , 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K. 65, Private, , HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K. 36, Federal-gov, , Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.

c4.5 Output The decision tree proper. –(weighted training examples/weighted training error) Tables of training error and testing error Confusion matrix You’ll want to pipe the output of c4.5 to a text file for later viewing. –E.g., c4.5 –u –f filestem > filestem.results

Example output capital-gain > 6849 : >50K (203.0/6.2) | capital-gain <= 6849 : | | capital-gain > 6514 : <=50K (7.0/1.3) | | capital-gain <= 6514 : | | | marital-status = Married-civ-spouse: >50K (18.0/1.3) | | | marital-status = Divorced: <=50K (2.0/1.0) | | | marital-status = Never-married: >50K (0.0) | | | marital-status = Separated: >50K (0.0) | | | marital-status = Widowed: >50K (0.0) | | | marital-status = Married-spouse-absent: >50K (0.0) | | | marital-status = Married-AF-spouse: >50K (0.0) Tree saved Evaluation on training data (4660 items): Before Pruning After Pruning Size Errors Size Errors Estimate ( 7.9%) (14.1%) (16.0%) << Evaluation on test data (2376 items): Before Pruning After Pruning Size Errors Size Errors Estimate (17.7%) (14.9%) (16.0%) << (a) (b)<-classified as (a): class >50K (b): class <=50K

k-fold Cross Validation Start with one large data set. Using a script, randomly divide this data set into k sets. At each iteration, use k-1 sets to train the decision tree, and the remaining set to test the model. Repeat this k times and take the average testing error. The avg. error describes how well the learning algorithm can be applied to the data set.