Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
Motivation Different classification systems: Black box systems (stat/neural) lack comprehensibility. Fuzzy logic or rough sets usually lead to complicated systems that are not understandable. Crisp logical rules may be the best solution. Advantages of logical rules: Comprehensibility (sometimes more important than the best accuracy). Can find the most important concepts of the problem Explain classification results (very important for instance in medicine) If simple, they show the most important features.
Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. Discovering simplest class structures, its inductive bias: requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria.
DT Forests Problem with DT (also NN): Not stable, small input changes lead to a different tree (network) structures. Heterogeneous Forests of Decision Trees: all simple trees may be interesting! An expert gets alternative problem descriptions. Solutions with different sensitivity and specificity for similar accuracy are generated.
Similarity-based HAS Local distance functions optimized differently in different regions of feature space. Weighted Minkovsky distance functions: Ex: =20 and other types of functions, including probabilistic functions, changing piecewise linear decision borders. RBF networks with different transfer function; LVQ with different local functions.
HAS decision trees Decision trees select the best feature/threshold value for univariate and multivariate trees: Decision borders: hyperplanes. Introducing tests based on L Minkovsky metric. For L 2 spherical decision border are produced. For L ∞ rectangular border are produced. Many choices, for example Fisher Linear Discrimination decision trees.
Separability criterion Separate different classes as well as possible. Use both continuous and discrete attributes to generate for different features class separation indices that are comparable. How? Splitting continuous attributes (automatic and context-dependent generation of linguistic variables) choosing best discrete features values (by analysis of all subsets - due to the complexity 2 N it is recommended to avoid discrete features with more than 10 values) combining best intervals and sets in a tree which can be easily converted to a set of classification rules.
SSV HAS DT Define left and right areas for test T with threshold (or subset) s : Count how many pairs of vectors from different classes are separated and how many vectors from the same class are separated.
SSV HAS algorithm Compromise between complexity/flexibility: Use training vectors for reference R Calculate T R (X)=D(X,R) for all data vectors, i.e. the distance matrix. Use T R (X) as additional test conditions. Calculate SSV(s) for each condition and select the best split. Different distance functions lead to different decision borders. Several distance functions are used simultaneously points, noisy 10 D plane, rotated 45 o, + half- sphere centered on the plane. Standard SSV tree: 44 rules, 99.7% HAS SSV tree (Euclidean): 15 rules, 99.9%
What to measure? Overall accuracy is not always the most important thing. Given a model M, confusion matrix for a class + and all other classes is: rows = true, columns = predicted by M
Quantities derived from p(C i |C j ) Several quantities are used to evaluate classification models M created to distinguish C + class:
SSV HAS Iris Iris data: 3 classes, 50 samples/class. SSV solution with the usual conditions (6 errors, 96%), or with distance test using vectors from a give node only: if petal length < 2.45 then class 1 if petal length > 2.45 and petal width < 1.65 then class 2 if petal length > 2.45 and petal width > 1.65 then class 3 SSV with Euclidean distance tests using all training vectors as reference (5 errors, 96.7%) 1. if petal length < 2.45 then class 1 2. if petal length > 2.45 and ||X-R 15 || < 4.02 then class 2 3. if petal length > 2.45 and ||X-R 15 || > 4.02 then class 3 ||X-R 15 || is the Euclidean distance to the vector R 15.
SSV HAS Wisconsin Wisconsin breast cancer dataset (UCI) 699 cases, 9 features (cell parameters, 1..10) Classes: benign 458 (65.5%) & malignant 241 (34.5%). Single rule gives simplest known description of this data: IF ||X-R 303 || < then malignant else benign 18 errors, A=97.4%, S + = 97.9%, S = 96.9%, K = Good prototype for malignant! Cost K for a=5. Simple thresholds, that’s what MDs like the most! Best L1O error 98.3% (FSM), best 10CV around 97.5% (Naïve Bayes + kernel, SVM) C 4.5 gives 94.7±2.0% SSV without distances: 96.4±2.1%
Wisconsin results 1 Tree 1 R 1 If F 3 < 2.5 then benign R 2 If F 6 < 2.5 F 5 < 3.5 then benign R 3 else malignant A=95.6% (25 err+6 uncl.) S + =95.0%, S =95.9%, K=0.104 Tree 2 a = 5 R 1 If F 2 < 2.5 then benign R 2 If F 2 < 4.5 F 6 < 3.5 then benign R 3 else malignant A=95.0% (33 err+2 uncl.) S + =90.5%, S =97.4%, K=0.107
Wisconsin results 2 Tree 3 R 1 If F 3 < 2.5 then benign R 2 If F 5 < 2.5 F 6 < 2.5 then benign R 3 else malignant A=95.1% (34 err) S + =95.0%, S =95.2%, K= Tree 4 R 1 If F 3 < 2.5 then benign R 2 If F 2 < 2.5 F 5 < 2.5 then benign R 3 else malignant A=95.1% (34 err.) S + =95.9%, S =94.8%, K=0.1166
Breast cancer recurrence (Ljubliana) 286 cases, 201 no-recurrence-events (70.3%), 85 recurrence-events (29.7%). 9 attributes with 2-13 different values each. Difficult and noisy data, from UCI. Tree 1 R 1 If deg-malig >2.5 inv-nodes > 2.5 then recurrence R 2 If deg-malig >2.5 inv-nodes < 2.5 (tumor-size [25-34] tumor-size [50-54]) then recurrence-events R 3 else no-recurrence-events A=76.9% (66 err.), S + =47.1%, S =89.6%, K=0.526
Breast cancer recurrence (Ljubliana) Tree 2 R 1 If breast = left inv-nodes > 2.5 then recurrence-events R 2 else no-recurrence-events A=75.5% (70 err.), S + =30.5%, S =94.5%, K=0.570 Tree 3 R 1 If deg-malig > 2.5 inv-nodes > 2.5 then recurrence-events R 2 else no-recurrence-events A=76.2% (68 err.), S + =31.8%, S =95.0%, K=0.554 Best rule in CV tests, best interpretation.
Conclusions Heterogeneous systems are worth investigating. Good biological justification of HAS approach. Better learning cannot repair wrong bias of the model. StatLog report: large differences of RBF and MLP on many datasets. Networks, trees, kNN should select/optimize their functions. Radial and sigmoidal functions in NN are not the only choice. Simple solutions may be discovered by HAS systems. Open questions: How to train heterogeneous systems? Find optimal balance between complexity/flexibility? Ex. complexity of nodes vs. interactions (weights)? Hierarchical, modular networks: nodes that are networks themselves.
The End ? Perhaps still the beginning...