Koichi Odajima & Yoichi Hayashi

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Random Forest Predrag Radenković 3237/10
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
Decision Tree Approach in Data Mining
Literal and ProRulext: Algorithms for Rule Extraction of ANNs Paulemir G. Campos, Teresa B. Ludermir {pgc,
Prachi Saraph, Mark Last, and Abraham Kandel. Introduction Black-Box Testing Apply an Input Observe the corresponding output Compare Observed output with.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Deep Belief Networks for Spam Filtering
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Results Comparison with existing approaches on benchmark datasets Evaluation on a uveal melanoma datasetEvaluation on the two-spiral dataset Evaluation.
CS 4700: Foundations of Artificial Intelligence
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Foundation of High-Dimensional Data Visualization
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
Saichon Jaiyen, Chidchanok Lursinsap, Suphakant Phimoltares IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 3, MARCH Paper study-
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
CS 391L: Machine Learning: Ensembles
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,
Color Image Segmentation Speaker: Deng Huipeng 25th Oct , 2007.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
IE 585 Competitive Network – Learning Vector Quantization & Counterpropagation.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Introduction to Classifiers Fujinaga. Bayes (optimal) Classifier (1) A priori probabilities: and Decision rule: given and decide if and probability of.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Data Mining Introduction to Classification using Linear Classifiers
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Sofus A. Macskassy Fetch Technologies
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Chapter 6 Classification and Prediction
Machine Learning Dr. Mohamed Farouk.
Self organizing networks
Figure 1.1 Rules for the contact lens data.
Data Mining Practical Machine Learning Tools and Techniques
Classifiers Fujinaga.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Principal Component Analysis (PCA)
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
FCTA 2016 Porto, Portugal, 9-11 November 2016 Classification confusion within NEFCLASS caused by feature value skewness in multi-dimensional datasets Jamileh.
of the Artificial Neural Networks.
network of simple neuron-like computing elements
Backpropagation.
Discriminative Frequent Pattern Analysis for Effective Classification
Classifiers Fujinaga.
Concave Minimization for Support Vector Machine Classifiers
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Koichi Odajima & Yoichi Hayashi Greedy rule generation from discrete data and its use in neural network rule extraction Koichi Odajima & Yoichi Hayashi Meiji University, Japan Rudy Setiono National University of Singapore

Motivation Rule extraction from neural networks using decompositional approach: Cluster the hidden unit activation values. Generate rules to explain the network’s outputs using the clustered activation values Replace the conditions of the rules generated in Step 2 by the equivalent conditions involving the original input data

Motivation (continued) Greedy Rule Generation (GRG) algorithm is proposed to Generate rules to explain the network’s outputs using the clustered activation values

Greedy Rule Generation Input: Labeled data with discrete valued attributes. J is the number of attributes, Nj is the number of distinct values of attributes Aj, j = 1,2, … , J Output: Ordered classification rules. Step 1. Initialization. Decompose the input space into S = N1 x N2 x … x Nj subspaces For each subspaces Sp: Generate the rule Rp having J conditions (RpC1, RpC2, …, RpCj) Count the number of samples of class i in subspace Sp: Fp,i. Let Fp,I = maxi Fp,i If Fp,I > 0, let the conclusion of the rule be yI, i.e. samples belong to class I.

Greedy Rule Generation Set the optimized rule set R = Ф Step 2. Rule Generation Let P and I be such that FP,I = maxpi Fp,I and R be the rule for the samples in subspace SP. If FP,I = 0, stop. Generate a list of rules consisting of R and all the other rules R0 such that y0 = yI or y0 = “unlabeled” For all mergeable pairs of rules in the list (b), add the merged rule to the list and compute the class frequency of the samples covered by the new rule. Repeat this step until no new rule can be added to the list by merging.

Greedy Rule Generation Among all the rules in the list, select the best rule R*: it must include R it covers the maximum number of samples with the correct label it has the highest number of irrelevant attributes it covers the largest subspace of the input Let R = R U R* Set the class label of all samples in the subspaces covered by rule R* to “unlabeled” and their corresponding frequency to 0. Repeat from Step 2(a)

Illustrative example: the Iris data set Petal length Petal width Small Medium Large (49,0,0) setosa (1,0,0) (0,0,0) “unlabeled” (0,47,0) versicolor (0,1,6) virginica (0,1,4) (0,1,40)

Illustrative example: the Iris data set Petal length: - small: if petal length in [0.0, 2.0) - medium: if petal length in [2.0,4.93) - large: if petal length in [4.93, 6.90] Petal width: - small: if petal width in [0.0, 0.6) - medium: if petal width in [0.6,1.70) - large: if petal width in [1.70, 2.50]

Illustrative example: the Iris data set The first rule generated: R = (petal length = small, petal width = small) ⇒ setosa followed by R0 = (petal length = small, petal width = small) ⇒ setosa R1 = (petal length = small, petal width = medium) ⇒ setosa R2 = (petal length = small, petal width = large) ⇒ unlabeled R3 = (petal length = medium, petal width = small) ⇒ unlabeled R4 = (petal length = large, petal width = small) ⇒ unlabeled R5 = (petal length = small, petal width = large) ⇒ unlabeled

Illustrative example: the Iris data set Merge (R0,R1), (R1,R2), (R0,R3), and (R3,R4) and obtain: R5 = R0 U R1 = (petal length = small, petal width = small or medium) ⇒ setosa R6 = R1 U R2 = (petal length = small, petal width = medium or large) ⇒ setosa R7 = R0 U R3 = (petal length = small or medium, petal width = small) ⇒ setosa R8 = R3 U R4 = (petal length = medium or large, petal width = small) ⇒ unlabeled Finally merge R0 with R6 and R0 with R8: R9 = R0 U R6 = (petal length = small, petal width = small or medium or large) ⇒ setosa R10 = R0 U R8 = (petal length = small, medium or large, petal width = small) ⇒ setosa

Illustrative example: the Iris data set Choose the best rule R*: It must include R: R1, R2, R3, R4, R6, and R8 are excluded. R9 is selected as it covers the maximum number of samples (50). Hence, R = R U R* = R9 (petal length = small, petal width = small or medium or large) ⇒ setosa Set the label of all samples covered as “unlabeled”

Illustrative example: the Iris data set After one iteration: Petal length Petal width Small Medium Large (0,0,0) “unlabeled” (0,47,0) versicolor (0,1,6) virginica (0,1,4) (0,1,40)

Illustrative example: the Iris data set For the next iteration start with the rule R: (petal length = medium, petal width = medium) ⇒ versicolor as it covers the largest number of samples (47). The rule R* = (petal length = small or medium, petal width = small or medium) ⇒ versicolor is generated.

Illustrative example: the Iris data set After two iterations: Petal length Petal width Small Medium Large (0,0,0) setosa (1,0,0) “unlabeled” (0,1,6) virginica (0,1,4) (0,1,40)

Illustrative example: the Iris data set The last iteration of the algorithm will generate a rule that classifies the remaining samples as virginica. Complete rule set: If petal length = small, then setosa Else if petal length = small or medium, petal width = small or medium, then versicolor Else virginica.

Illustrative example: artificial data set If x1 ≤ 0.25, then y = 0, Else if x1 ≤ 0.75 and x2 > 0.75, then y = 0, Else if x1- x2 > 0.25, then y = 1, Else y = 2. 1000 data points generated 5% noise in the class label

Illustrative example: artificial data set Extracted rule: If x1- x2 > 0.24, then y = 1, Else if x2 ≤ 0.75 and x1 ≥ 0.25, then y = 2, Else if x1 < 0.74, then y = 0, Else y = 2.

Experimental results Data set No. of samples No. attributes Australian credit approval 690 8 discrete, 6 continuous Boston housing 506 1 discrete, 12 continuous Cleveland heart disease 297 5 discrete, 8 continuous Wisconsin breast cancer 699 9 discrete Results from 10 fold cross-validation run.

Experimental results Data set C4.5 NeuroLinear NL +GRG % Acc #rules Australian credit approval 84.36 9.30 83.64 6.60 86.40 2.80 Boston housing 86.13 11.30 80.60 3.05 85.71 2.90 Cleveland heart disease 77.26 10.20 78.15 5.69 81.72 2.20 Wisconsin breast cancer 96.10 8.80 95.73 2.89 95.96 2.00

Conclusion GRG method is proposed for generating classification rules. It could be applied directly to small data sets with discrete attribute values. For larger data sets, GRG could be used in the context of neural network rule extraction by applying it on the discretized hidden unit activation values. Results from some UCI data sets show the neural network rule extraction approach with GRG produces concise rule sets.

Thank you!