Koichi Odajima & Yoichi Hayashi Greedy rule generation from discrete data and its use in neural network rule extraction Koichi Odajima & Yoichi Hayashi Meiji University, Japan Rudy Setiono National University of Singapore
Motivation Rule extraction from neural networks using decompositional approach: Cluster the hidden unit activation values. Generate rules to explain the network’s outputs using the clustered activation values Replace the conditions of the rules generated in Step 2 by the equivalent conditions involving the original input data
Motivation (continued) Greedy Rule Generation (GRG) algorithm is proposed to Generate rules to explain the network’s outputs using the clustered activation values
Greedy Rule Generation Input: Labeled data with discrete valued attributes. J is the number of attributes, Nj is the number of distinct values of attributes Aj, j = 1,2, … , J Output: Ordered classification rules. Step 1. Initialization. Decompose the input space into S = N1 x N2 x … x Nj subspaces For each subspaces Sp: Generate the rule Rp having J conditions (RpC1, RpC2, …, RpCj) Count the number of samples of class i in subspace Sp: Fp,i. Let Fp,I = maxi Fp,i If Fp,I > 0, let the conclusion of the rule be yI, i.e. samples belong to class I.
Greedy Rule Generation Set the optimized rule set R = Ф Step 2. Rule Generation Let P and I be such that FP,I = maxpi Fp,I and R be the rule for the samples in subspace SP. If FP,I = 0, stop. Generate a list of rules consisting of R and all the other rules R0 such that y0 = yI or y0 = “unlabeled” For all mergeable pairs of rules in the list (b), add the merged rule to the list and compute the class frequency of the samples covered by the new rule. Repeat this step until no new rule can be added to the list by merging.
Greedy Rule Generation Among all the rules in the list, select the best rule R*: it must include R it covers the maximum number of samples with the correct label it has the highest number of irrelevant attributes it covers the largest subspace of the input Let R = R U R* Set the class label of all samples in the subspaces covered by rule R* to “unlabeled” and their corresponding frequency to 0. Repeat from Step 2(a)
Illustrative example: the Iris data set Petal length Petal width Small Medium Large (49,0,0) setosa (1,0,0) (0,0,0) “unlabeled” (0,47,0) versicolor (0,1,6) virginica (0,1,4) (0,1,40)
Illustrative example: the Iris data set Petal length: - small: if petal length in [0.0, 2.0) - medium: if petal length in [2.0,4.93) - large: if petal length in [4.93, 6.90] Petal width: - small: if petal width in [0.0, 0.6) - medium: if petal width in [0.6,1.70) - large: if petal width in [1.70, 2.50]
Illustrative example: the Iris data set The first rule generated: R = (petal length = small, petal width = small) ⇒ setosa followed by R0 = (petal length = small, petal width = small) ⇒ setosa R1 = (petal length = small, petal width = medium) ⇒ setosa R2 = (petal length = small, petal width = large) ⇒ unlabeled R3 = (petal length = medium, petal width = small) ⇒ unlabeled R4 = (petal length = large, petal width = small) ⇒ unlabeled R5 = (petal length = small, petal width = large) ⇒ unlabeled
Illustrative example: the Iris data set Merge (R0,R1), (R1,R2), (R0,R3), and (R3,R4) and obtain: R5 = R0 U R1 = (petal length = small, petal width = small or medium) ⇒ setosa R6 = R1 U R2 = (petal length = small, petal width = medium or large) ⇒ setosa R7 = R0 U R3 = (petal length = small or medium, petal width = small) ⇒ setosa R8 = R3 U R4 = (petal length = medium or large, petal width = small) ⇒ unlabeled Finally merge R0 with R6 and R0 with R8: R9 = R0 U R6 = (petal length = small, petal width = small or medium or large) ⇒ setosa R10 = R0 U R8 = (petal length = small, medium or large, petal width = small) ⇒ setosa
Illustrative example: the Iris data set Choose the best rule R*: It must include R: R1, R2, R3, R4, R6, and R8 are excluded. R9 is selected as it covers the maximum number of samples (50). Hence, R = R U R* = R9 (petal length = small, petal width = small or medium or large) ⇒ setosa Set the label of all samples covered as “unlabeled”
Illustrative example: the Iris data set After one iteration: Petal length Petal width Small Medium Large (0,0,0) “unlabeled” (0,47,0) versicolor (0,1,6) virginica (0,1,4) (0,1,40)
Illustrative example: the Iris data set For the next iteration start with the rule R: (petal length = medium, petal width = medium) ⇒ versicolor as it covers the largest number of samples (47). The rule R* = (petal length = small or medium, petal width = small or medium) ⇒ versicolor is generated.
Illustrative example: the Iris data set After two iterations: Petal length Petal width Small Medium Large (0,0,0) setosa (1,0,0) “unlabeled” (0,1,6) virginica (0,1,4) (0,1,40)
Illustrative example: the Iris data set The last iteration of the algorithm will generate a rule that classifies the remaining samples as virginica. Complete rule set: If petal length = small, then setosa Else if petal length = small or medium, petal width = small or medium, then versicolor Else virginica.
Illustrative example: artificial data set If x1 ≤ 0.25, then y = 0, Else if x1 ≤ 0.75 and x2 > 0.75, then y = 0, Else if x1- x2 > 0.25, then y = 1, Else y = 2. 1000 data points generated 5% noise in the class label
Illustrative example: artificial data set Extracted rule: If x1- x2 > 0.24, then y = 1, Else if x2 ≤ 0.75 and x1 ≥ 0.25, then y = 2, Else if x1 < 0.74, then y = 0, Else y = 2.
Experimental results Data set No. of samples No. attributes Australian credit approval 690 8 discrete, 6 continuous Boston housing 506 1 discrete, 12 continuous Cleveland heart disease 297 5 discrete, 8 continuous Wisconsin breast cancer 699 9 discrete Results from 10 fold cross-validation run.
Experimental results Data set C4.5 NeuroLinear NL +GRG % Acc #rules Australian credit approval 84.36 9.30 83.64 6.60 86.40 2.80 Boston housing 86.13 11.30 80.60 3.05 85.71 2.90 Cleveland heart disease 77.26 10.20 78.15 5.69 81.72 2.20 Wisconsin breast cancer 96.10 8.80 95.73 2.89 95.96 2.00
Conclusion GRG method is proposed for generating classification rules. It could be applied directly to small data sets with discrete attribute values. For larger data sets, GRG could be used in the context of neural network rule extraction by applying it on the discretized hidden unit activation values. Results from some UCI data sets show the neural network rule extraction approach with GRG produces concise rule sets.
Thank you!