C4.5 Demo Andrew Rosenberg CS /30/04
What is c4.5? c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes.
Running c4.5 On cunix.columbia.edu –~amr2104/c4.5/bin/c4.5 –u –f filestem On cluster.cs.columbia.edu –~amaxwell/c4.5/bin/c4.5 –u –f filestem c4.5 expects to find 3 files –filestem.names –filestem.data –filestem.test
File Format:.names The file begins with a comma separated list of classes ending with a period, followed by a blank line –E.g, >50K, <=50K. The remaining lines have the following format (note the end of line period): –Attribute: {ignore, discrete n, continuous, list}.
Example: census.names >50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, etc. occupation: Tech-support, Craft-repair, Other-service, Sales, etc. relationship: Wife, Own-child, Husband, Not-in-family, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc.
File Format:.data,.test Each line in these data files is a comma separated list of attribute values ending with a class label followed by a period. –The attributes must be in the same order as described in the.names file. –Unavailable values can be entered as ‘?’ When creating test sets, make sure that you remove these data points from the training data.
Example: adult.test 25, Private, , 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K. 38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K. 28, Local-gov, , Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K. 44, Private, , Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K. 18, ?, , Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K. 34, Private, , 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K. 29, ?, , HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United- States, <=50K. 63, Self-emp-not-inc, , Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K. 24, Private, , Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K. 55, Private, , 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K. 65, Private, , HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K. 36, Federal-gov, , Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.
c4.5 Output The decision tree proper. –(weighted training examples/weighted training error) Tables of training error and testing error Confusion matrix You’ll want to pipe the output of c4.5 to a text file for later viewing. –E.g., c4.5 –u –f filestem > filestem.results
Example output capital-gain > 6849 : >50K (203.0/6.2) | capital-gain <= 6849 : | | capital-gain > 6514 : <=50K (7.0/1.3) | | capital-gain <= 6514 : | | | marital-status = Married-civ-spouse: >50K (18.0/1.3) | | | marital-status = Divorced: <=50K (2.0/1.0) | | | marital-status = Never-married: >50K (0.0) | | | marital-status = Separated: >50K (0.0) | | | marital-status = Widowed: >50K (0.0) | | | marital-status = Married-spouse-absent: >50K (0.0) | | | marital-status = Married-AF-spouse: >50K (0.0) Tree saved Evaluation on training data (4660 items): Before Pruning After Pruning Size Errors Size Errors Estimate ( 7.9%) (14.1%) (16.0%) << Evaluation on test data (2376 items): Before Pruning After Pruning Size Errors Size Errors Estimate (17.7%) (14.9%) (16.0%) << (a) (b)<-classified as (a): class >50K (b): class <=50K
k-fold Cross Validation Start with one large data set. Using a script, randomly divide this data set into k sets. At each iteration, use k-1 sets to train the decision tree, and the remaining set to test the model. Repeat this k times and take the average testing error. The avg. error describes how well the learning algorithm can be applied to the data set.