Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 2: Rule-Based Approaches
Copyright © 2004 by Jinyan Li and Limsoon Wong Outline Overview of Supervised Learning Decision Trees Ensembles –Bagging –Boosting –Random forest –Randomization trees –CS4
Copyright © 2004 by Jinyan Li and Limsoon Wong Overview of Supervised Learning
Copyright © 2004 by Jinyan Li and Limsoon Wong Computational Supervised Learning Also called classification Learn from past experience, and use the learned knowledge to classify new data Knowledge learned by intelligent algorithms Examples: –Clinical diagnosis for patients –Cell type classification
Copyright © 2004 by Jinyan Li and Limsoon Wong Data Classification application involves > 1 class of data. E.g., –Normal vs disease cells for a diagnosis problem Training data is a set of instances (samples, points) with known class labels Test data is a set of instances whose class labels are to be predicted
Copyright © 2004 by Jinyan Li and Limsoon Wong Notation Training data { x 1, y 1 , x 2, y 2 , …, x m, y m } where x j are n-dimensional vectors and y j are from a discrete space Y. E.g., Y = {normal, disease}. Test data { u 1, ? , u 2, ? , …, u k, ? , }
Training data: X Class labels Y f(X) A classifier, a mapping, a hypothesis Test data: U Predicted class labels f(U) Copyright © 2004 by Jinyan Li and Limsoon Wong Process
x 11 x 12 x 13 x 14 … x 1n x 21 x 22 x 23 x 24 … x 2n x 31 x 32 x 33 x 34 … x 3n …………………………………. x m1 x m2 x m3 x m4 … x mn n features (order of 1000) m samples class PNPNPNPN gene 1 gene 2 gene 3 gene 4 … gene n Copyright © 2004 by Jinyan Li and Limsoon Wong Relational Representation of Gene Expression Data
Copyright © 2004 by Jinyan Li and Limsoon Wong Features Also called attributes Categorical features –feature color = {red, blue, green} Continuous or numerical features –gene expression –age –blood pressure Discretization
An Example Copyright © 2004 by Jinyan Li and Limsoon Wong
Biomedical Financial Government Scientific Decision trees Emerging patterns SVM Neural networks Classifiers (M-Doctors) Copyright © 2004 by Jinyan Li and Limsoon Wong Overall Picture of Supervised Learning
Copyright © 2004 by Jinyan Li and Limsoon Wong Evaluation of a Classifier Performance on independent blind test data K-fold cross validation: Given a dataset, divide it into k even parts, k-1 of them are used for training, and the rest one part treated as test data LOOCV, a special case of K-fold CV Accuracy, error rate False positive rate, false negative rate, sensitivity, specificity, precision
Copyright © 2004 by Jinyan Li and Limsoon Wong Requirements of Biomedical Classification High accuracy High comprehensibility
Copyright © 2004 by Jinyan Li and Limsoon Wong Importance of Rule-Based Methods Systematic selection of a small number of features used for decision making. Increase the comprehensibility of the knowledge patterns C4.5 and CART are two commonly used rule induction algorithms, or called decision tree induction algorithms
Leaf nodes Internal nodes Root node A B B A A x1x1 x2x2 x4x4 x3x3 > a 1 > a 2 Copyright © 2004 by Jinyan Li and Limsoon Wong Structure of Decision Trees If x 1 > a 1 & x 2 > a 2, then it’s A class C4.5, CART, two of the most widely used Easy interpretation, but accuracy generally unattractive
Elegance of Decision Trees A B BA A Copyright © 2004 by Jinyan Li and Limsoon Wong
CLS (Hunt etal. 1966)--- cost drivenID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas CART (Breiman et al. 1984) --- Gini Index Brief History of Decision Trees
9 Play samples 5 Don’t A total of 14. A Simple Dataset Copyright © 2004 by Jinyan Li and Limsoon Wong
2 outlook windy humidity Play Don’t sunny overcast rain <= 75 > 75 false true A Decision Tree NP-complete problem Copyright © 2004 by Jinyan Li and Limsoon Wong
Construction of a Decision Tree Determination of the root node of the tree and the root node of its sub-trees
Copyright © 2004 by Jinyan Li and Limsoon Wong Most Discriminatory Feature Every feature can be used to partition the training data If the partitions contain a pure class of training instances, then this feature is most discriminatory
Copyright © 2004 by Jinyan Li and Limsoon Wong Example of Partitions Categorical feature –Number of partitions of the training data is equal to the number of values of this feature Numerical feature –Two partitions
OutlookTempHumidityWindy class Sunny7570truePlay Sunny8090 trueDon’t Sunny8585 falseDon’t Sunny 7295trueDon’t Sunny6970falsePlay Overcast7290truePlay Overcast8378falsePlay Overcast6465truePlay Overcast8175falsePlay Rain7180trueDon’t Rain6570trueDon’t Rain 7580false Play Rain6880false Play Rain7096falsePlay Instance # Copyright © 2004 by Jinyan Li and Limsoon Wong
Total 14 training instances 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Outlook = sunny Outlook = overcast Outlook = rain Copyright © 2004 by Jinyan Li and Limsoon Wong
Total 14 training instances 5,8,11,13,14 P,P, D, P, P 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P Temperature <= 70 Temperature > 70 Copyright © 2004 by Jinyan Li and Limsoon Wong
Three Measures Gini index Information gain Gain ratio
Copyright © 2004 by Jinyan Li and Limsoon Wong Steps of Decision Tree Construction Select the best feature as the root node of the whole tree After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree Recursively, until the partitions become pure or almost pure
Copyright © 2004 by Jinyan Li and Limsoon Wong Missing many globally significant rules; mislead the system Characteristics of C4.5 Trees Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules
Copyright © 2004 by Jinyan Li and Limsoon Wong Decision Tree Ensembles Bagging Boosting Random forest Randomization trees CS4
Copyright © 2004 by Jinyan Li and Limsoon Wong h 1, h 2, h 3 are indep classifiers w/ accuracy = 60% C 1, C 2 are the only classes t is a test instance in C 1 h(t) = argmax C {C1,C2} |{h j {h 1, h 2, h 3 } | h j (t) = C}| Then prob(h(t) = C 1 ) = prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 2 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 2 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 2 & h 2 (t)=C 1 & h 3 (t)=C 1 ) = 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60% = 64.8% Motivating Example
Copyright © 2004 by Jinyan Li and Limsoon Wong Bagging Proposed by Breiman (1996) Also called Bootstrap aggregating Make use of randomness injected to training data
50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer such as C4.5 A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong
Decision Making by Bagging Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong
Boosting AdaBoost by Freund & Schapire (1995) Also called Adaptive Boosting Make use of weighted instances and weighted voting
Main Ideas 100 instances with equal weight A classifier h1 error If error is 0 or >0.5 stop Otherwise re- weight: e1/(1-e1) Renormalize to instances with different weights A classifier h2 error Copyright © 2004 by Jinyan Li and Limsoon Wong
Given a new test sample T Decision Making by AdaBoost.M1 Copyright © 2004 by Jinyan Li and Limsoon Wong
Bagging vs Boosting Bagging –Construction of Bagging classifiers are independent –Equal voting Boosting –Construction of a new Boosting classifier depends on the performance of its previous classifier, i.e. sequential construction (a series of classifiers) –Weighted voting
Copyright © 2004 by Jinyan Li and Limsoon Wong Random Forest Proposed by Breiman (2001) Similar to Bagging, but the base inducer is not the standard C4.5 Make use twice of randomness
50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer (not C4.5 but revised) A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong
Root node Original n number of features Selection is from m try number of randomly chosen features A Revised C4.5 as Base Classifier Copyright © 2004 by Jinyan Li and Limsoon Wong
Decision Making by Random Forest Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong
Randomization Trees Proposed by Dietterich (2000) Make use of randomness in the selection of the best split point
Root node Original n number of features Select one randomly from {feature 1: choice 1,2,3 feature 2: choise 1, 2,. feature 8: choice 1, 2, 3 } Total 20 candidates Equal voting on the committee of such decision trees Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong
CS4 Proposed by Li et al (2003) CS4: Cascading and Sharing for decision trees Don’t make use of randomness
Selection of root nodes is in a cascading manner! 1 2 k tree-1 tree-2 tree-k total k trees root nodes Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong
Not equal voting Decision Making by CS4 Copyright © 2004 by Jinyan Li and Limsoon Wong
Bagging Random Forest AdaBoost.M1 Randomization Trees CS4 Rules may not be correct when applied to training data Rules correct Copyright © 2004 by Jinyan Li and Limsoon Wong Summary of Ensemble Classifiers
Copyright © 2004 by Jinyan Li and Limsoon Wong Any Question?