Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Slides:

Advertisements

Similar presentations

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Advertisements

Slides from: Doug Gray, David Poole

ECG Signal processing (2)

Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,

Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Data Mining Classification: Alternative Techniques

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

GhostMiner Wine example Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,

PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland, School of.

Lecture 14 – Neural Networks

Support Vector Machines and Kernel Methods

Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.

Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.

Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Coloring black boxes: visualization of neural network decisions Włodzisław Duch School of Computer Engineering, Nanyang Technological University, Singapore,

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.

A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Radial Basis Function Networks

Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

_____KOSYR 2001______ Rules for Melanoma Skin Cancer Diagnosis Włodzisław Duch, K. Grąbczewski, R. Adamczak, K. Grudziński, Department of Computer Methods,

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Knowledge-Based Breast Cancer Prognosis Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Computation and Informatics in Biology and Medicine.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Chapter 6: Techniques for Predictive Modeling

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,

Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Breast Cancer Diagnosis via Neural Network Classification Jing Jiang May 10, 2000.

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:

SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Support Feature Machine for DNA microarray data

Computational Intelligence: Methods and Applications

Computational Intelligence: Methods and Applications

Instance Based Learning

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

Projection of network outputs

Fuzzy rule-based system derived from similarity to prototypes

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Heterogeneous adaptive systems

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

A task of induction to find patterns

Presentation transcript:

Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Motivation Different classification systems: Black box systems (stat/neural) lack comprehensibility. Fuzzy logic or rough sets usually lead to complicated systems that are not understandable. Crisp logical rules may be the best solution. Advantages of logical rules: Comprehensibility (sometimes more important than the best accuracy). Can find the most important concepts of the problem Explain classification results (very important for instance in medicine) If simple, they show the most important features.

Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. Discovering simplest class structures, its inductive bias: requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria.

DT Forests Problem with DT (also NN): Not stable, small input changes lead to a different tree (network) structures. Heterogeneous Forests of Decision Trees: all simple trees may be interesting! An expert gets alternative problem descriptions. Solutions with different sensitivity and specificity for similar accuracy are generated.

Similarity-based HAS Local distance functions optimized differently in different regions of feature space. Weighted Minkovsky distance functions: Ex:  =20 and other types of functions, including probabilistic functions, changing piecewise linear decision borders. RBF networks with different transfer function; LVQ with different local functions.

HAS decision trees Decision trees select the best feature/threshold value for univariate and multivariate trees: Decision borders: hyperplanes. Introducing tests based on L  Minkovsky metric. For L 2 spherical decision border are produced. For L ∞ rectangular border are produced. Many choices, for example Fisher Linear Discrimination decision trees.

Separability criterion Separate different classes as well as possible. Use both continuous and discrete attributes to generate for different features class separation indices that are comparable. How? Splitting continuous attributes (automatic and context-dependent generation of linguistic variables) choosing best discrete features values (by analysis of all subsets - due to the complexity 2 N it is recommended to avoid discrete features with more than 10 values) combining best intervals and sets in a tree which can be easily converted to a set of classification rules.

SSV HAS DT Define left and right areas for test T with threshold (or subset) s : Count how many pairs of vectors from different classes are separated and how many vectors from the same class are separated.

SSV HAS algorithm Compromise between complexity/flexibility: Use training vectors for reference R Calculate T R (X)=D(X,R) for all data vectors, i.e. the distance matrix. Use T R (X) as additional test conditions. Calculate SSV(s) for each condition and select the best split. Different distance functions lead to different decision borders. Several distance functions are used simultaneously points, noisy 10 D plane, rotated 45 o, + half- sphere centered on the plane. Standard SSV tree: 44 rules, 99.7% HAS SSV tree (Euclidean): 15 rules, 99.9%

What to measure? Overall accuracy is not always the most important thing. Given a model M, confusion matrix for a class + and all other classes is: rows = true, columns = predicted by M

Quantities derived from p(C i |C j ) Several quantities are used to evaluate classification models M created to distinguish C + class:

SSV HAS Iris Iris data: 3 classes, 50 samples/class. SSV solution with the usual conditions (6 errors, 96%), or with distance test using vectors from a give node only: if petal length < 2.45 then class 1 if petal length > 2.45 and petal width < 1.65 then class 2 if petal length > 2.45 and petal width > 1.65 then class 3 SSV with Euclidean distance tests using all training vectors as reference (5 errors, 96.7%) 1. if petal length < 2.45 then class 1 2. if petal length > 2.45 and ||X-R 15 || < 4.02 then class 2 3. if petal length > 2.45 and ||X-R 15 || > 4.02 then class 3 ||X-R 15 || is the Euclidean distance to the vector R 15.

SSV HAS Wisconsin Wisconsin breast cancer dataset (UCI) 699 cases, 9 features (cell parameters, 1..10) Classes: benign 458 (65.5%) & malignant 241 (34.5%). Single rule gives simplest known description of this data: IF ||X-R 303 || < then malignant else benign 18 errors, A=97.4%, S + = 97.9%, S  = 96.9%, K = Good prototype for malignant! Cost K for a=5. Simple thresholds, that’s what MDs like the most! Best L1O error 98.3% (FSM), best 10CV around 97.5% (Naïve Bayes + kernel, SVM) C 4.5 gives 94.7±2.0% SSV without distances: 96.4±2.1%

Wisconsin results 1 Tree 1 R 1 If F 3 < 2.5 then benign R 2 If F 6 < 2.5  F 5 < 3.5 then benign R 3 else malignant A=95.6% (25 err+6 uncl.) S + =95.0%, S  =95.9%, K=0.104 Tree 2 a = 5 R 1 If F 2 < 2.5 then benign R 2 If F 2 < 4.5  F 6 < 3.5 then benign R 3 else malignant A=95.0% (33 err+2 uncl.) S + =90.5%, S  =97.4%, K=0.107

Wisconsin results 2 Tree 3 R 1 If F 3 < 2.5 then benign R 2 If F 5 < 2.5  F 6 < 2.5 then benign R 3 else malignant A=95.1% (34 err) S + =95.0%, S  =95.2%, K= Tree 4 R 1 If F 3 < 2.5 then benign R 2 If F 2 < 2.5  F 5 < 2.5 then benign R 3 else malignant A=95.1% (34 err.) S + =95.9%, S  =94.8%, K=0.1166

Breast cancer recurrence (Ljubliana) 286 cases, 201 no-recurrence-events (70.3%), 85 recurrence-events (29.7%). 9 attributes with 2-13 different values each. Difficult and noisy data, from UCI. Tree 1 R 1 If deg-malig >2.5  inv-nodes > 2.5 then recurrence R 2 If deg-malig >2.5  inv-nodes < 2.5  (tumor-size  [25-34]  tumor-size  [50-54]) then recurrence-events R 3 else no-recurrence-events A=76.9% (66 err.), S + =47.1%, S  =89.6%, K=0.526

Breast cancer recurrence (Ljubliana) Tree 2 R 1 If breast = left  inv-nodes > 2.5 then recurrence-events R 2 else no-recurrence-events A=75.5% (70 err.), S + =30.5%, S  =94.5%, K=0.570 Tree 3 R 1 If deg-malig > 2.5  inv-nodes > 2.5 then recurrence-events R 2 else no-recurrence-events A=76.2% (68 err.), S + =31.8%, S  =95.0%, K=0.554 Best rule in CV tests, best interpretation.

Conclusions Heterogeneous systems are worth investigating. Good biological justification of HAS approach. Better learning cannot repair wrong bias of the model. StatLog report: large differences of RBF and MLP on many datasets. Networks, trees, kNN should select/optimize their functions. Radial and sigmoidal functions in NN are not the only choice. Simple solutions may be discovered by HAS systems. Open questions: How to train heterogeneous systems? Find optimal balance between complexity/flexibility? Ex. complexity of nodes vs. interactions (weights)? Hierarchical, modular networks: nodes that are networks themselves.

The End ? Perhaps still the beginning...