Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology Singapore
Outline Introduction Brief history of decision trees and emerging patterns Basic ideas for decision trees and EPs Advanced topics Comparisons using gene expression data Summary
Introduction Decision trees and emerging patterns both classification rules Sharp discrimination power (no or little uncertainty) Advantage over black-box learning models Decision trees not the best (on accuracy) EP-based classifiers: competitive to the best
Brief History of Decision Trees CLS (Hunt etal. 1966)--- cost driven ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Pruning ideas CART (Breiman et al. 1984) --- Gini Index
Brief history of emerging patterns General EP (Dong & Li, 1999 Sigkdd) CAEP (Dong etal, DS99), JEP-C (Li etal, KAIS00) EP-space (Li etal, ICML00), DeEPs (Li etal, MLJ) PCL (Li & Wong, ECML02)
Basic definitions Relational data Attributes (color, gene_x), attribute values (red, 200.1), attribute-value pair (equivalently, condition, item) Patterns, instances Training data, test data
A simple dataset OutlookTempHumidityWindy class Sunny7570truePlay Sunny8090 trueDon’t Sunny8585 falseDon’t Sunny 7295trueDon’t Sunny6970falsePlay Overcast7290truePlay Overcast8378falsePlay Overcast6465truePlay Overcast8175falsePlay Rain7180trueDon’t Rain6570trueDon’t Rain 7580false Play Rain6880false Play Rain7096falsePlay 9 Play samples 5 Don’t A total of 14.
A decision tree outlook windy humidity Play Don’t sunny overcast rain <= 75 > 75 false true NP-complete problem 3
C4.5 A heuristic Using information gain to select the most discriminatory feature (for tree and sub- trees) Recursive subdivision over the original training data
Characteristics of C4.5 trees Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules Missing many globally significant rules; mislead the system.
Emerging Patterns (1) An emerging pattern is a set of conditions usually involving several genes, with which most of a class satisfy but none of the other class satisfy. Real example: {gene(37720_at) > 215, gene(38028_at)<=12}. 73% vs 0% EPs are multi-gene discriminators.
Emerging Patterns (2) An EP = {cond.1, cond.2, and cond.3} C1 C2 100 C1 Samples100 C2 Samples 80% 0%
Boundary emerging patterns Definition: A boundary EP is an EP whose proper subsets are not EPs. Boundary EPs separate EPs from non- EPs Distinguish EPs with high frequency from EPs with low frequency. Boundary EPs are of our greatest interests.
EP rules derived A total of 12 EPs, some important ones of them never discovered by C4.5. Examples: {Humi Play (5:0). A total of 5 rules in the decision tree induced by C4.5. C4.5 missed many important rules.
Characteristics of EP approach Each EP is a tree with only one branch A cluster of trees: EPs combined (loss of elegance) Globally significant rules Exponential number in size (need to focus on very important feature if large number of features exist.)
Usefulness of Emerging Patterns in Classification PCL (Prediction by Collective Likelihood of Emerging Patterns). Accurate Easily understandable.
Spirit of the PCL Classifier (1) Top-Ranked EPs in Positive class Top-Ranked EPs in Negative class EP_1 (90%) EP_2 (86%). EP_n (68%) EP_1 (100%) EP_2 (95%). EP_n (80%) The idea of summarizing multiple top-ranked EPs is intended to avoid some rare tie cases.
Spirit of the PCL Classifier (2) Score_p = EP’_1_p / EP_1_p + … + EP’_k_p / EP_k_p Most freq. EP from posi. class in the test sample Most freq. EP of posi class Similarly, Score_n = EP’_1_n / EP_1_n + … + EP’_k_n / EP_k_n If Score_p > Score_n, then positive class, otherwise negative class. K=10, Ideal Score_p = 10 Score_n = 0
C4.5 and PCL Differences: –C4.5 is a greedy search algorithm using the divide-and-conquer idea. –PCL is a global search algorithm. –Leaves in C4.5 are allowed to contain mixed samples, however PCL does not. –Multiple trees used by PCL, but single tree used by C4.5. Similarities: –Both can provide high- level rules.
Advanced topics Bagging, boosting and C4.5 Convexity of EP spaces (ICML00, Li etal). Decomposition of EP spaces into a series of P-spaces and a small convex space (ECML02, Li and Wong). DeEPs (to appear in MLJ, Li etal). Version spaces (Mitchell, 1982 AI).
Gene Expression Profiles Huge number of features Most of them can be ignored when for classification Many good discriminating feautures Number of instances relatively small
Expression data in this talk Prostate disease, 102 instances, Cancer Cell v.1, issue 1, 2002 ALL disease, 327 instances, Cancer Cell, v.1, issue 2, 2002 MLL disease, 72 instances, Nature Genetics, Jan., 2002 Breast cancer, 98 instances, Nature, Jan. 2002
Classification models in this talk K-nearest neighbor (simplest) C4.5 (decision tree based, easily understandable ) Bagged and Boosted C4.5 Support Vector Machines (black box) Our PCL classifier
Our Work Flow Original Training Data Feature selection Establishing Classification Model Giving a Test Sample Making a Prediction
Selecting Discriminatory Genes T-statistics and MIT-correlation. –Based on expression average and deviation between two classes of cell. Entropy-based discretization methods, including Chi-Square statistics, and CFS. –Based on clear boundaries in the expression range of a gene.
An Ideal Gene Expression Range Expression values are two-end distributed. C1C2 Xyz.x
Results on Prostate Dataset 52 tumor samples and 50 normal samples, each represented by ~12,500 numeric values. Two Problems: -1- What is the main difference? How to use rules to represent the difference? -2- What’s the LOOCV accuracy by PCL and C4.5?
C4.5 Tree 32598_at 40707_at 33886_at Tumor Normal <=29>29 <= 10 > 10 <= -6 > -6 > _at Normal <=
Emerging patterns PatternsFrequency (T)Frequency(N) {9, 36}38 instances0 {9, 23}380 {4, 9}380 {9, 14}380 {6, 9}380 {7, 21}036 {7, 11}035 {7, 43}035 {7, 39}034 {24, 29}034 Reference number 9: the expression of 37720_at > 215. Reference number 36: the expression of 38028_at <= 12.
LOOCV accuracies ClassifierPCLC4.5 SVM3-NN Single Bagged Boosted Accuracy Error rate
Subtype classification of Childhood Leukemia Using Gene Expression Profiling One of our important projects.
Collaborating Parties St. Jude Children’s Research Hospital, USA. –Mary E. Ross, Sheila A. Shurtleff, W. Kent Williams, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Relling, Anami Patel, Cheng Cheng, Dario Campana, Ching-Hon Pui, William E. Evans, Clayton Naeve, and James R. Downing NUH, Singapore. –Eng-Juh Yeoh University of Mississippi, USA. –Dawn Wilkins, Xiaodong Zhou LIT, Singapore. –Jinyan Li, Huiqing Liu, Limsoon Wong
Important Motivations Leukemia is a heterogeneous disease (T-ALL, E2A-PBX1, TEL-AML1, BCR- ABL, MLL, and Hyperdip>50). Response is different. Leukemia is 80% curable if subtype is correctly diagnosed. Correct diagnosis needs many different tests and experts. Tests and experts not commonly available in a single hospital, especially in less advanced countries. So, developing new methods is needed.
ALL Data Description Subtype Training Testing T-ALL E2A-PBX TEL-AML BCR-ABL 9 6 MLL 14 6 Hyperdip Others Total A sample = {gene_1, gene_2, …, gene_12558}
Central questions Diagnosis: Classification of more than six subtypes of the leukemia disease Prognosis: Prediction of outcome of therapy
Classification Strategy A new sample T-ALL? Yes T-ALL predicted no E2A-PBX1? Yes E2A-PBX1 predicted no TEL-AML1? Yes TEL-AML1 predicted no BCR-ABL? Yes BCR-ABL predicted no MLL? Yes MLL predicted no Hyperdip50? Yes Hyperdip50 predicted no
Total Mis-classifications of the 112 Test Samples Feature Selection SVM NB k-NN C4.5 PCL Top20-ChiSquare Entropy Top20-mit Overall, we achieved 96% accuracy level in the prediction.
An Excellent Publication Cancer Cell, March ‘02 Lead article Cover page
Other Questions for the Childhood Leukemia Can one subtype separating from other subtypes? Can we do classification in parallel, rather than using a tree structure?
Test Error Rates DatasetsPCL C4.5 (112 test instances)Single Bagged Boosted BCR-ABL vs others 1:04:4 6:0 4:4 E2A-PBX1 vs others 0:0 0:0 0:0 0:0 Hyperdip50 vs others 2:24:7 4:2 4:7 MLL vs others 0:22:2 1:0 2:2 T-ALL vs others 0:0 1:0 1:0 1:0 TEL-AML1 vs others 2:0 2:2 2:1 2:2 Parallel classification Overall, PCL is better than C4.5 (single, bagged, or boosted).
MLL Distinction from Conventional ALL Armstrong, etal, Nature Genetics, Jan classes, 57 training samples, 15 test instances. Much smaller than the St.Jude’s data. Independently obtained data.
Test Error Rates by PCL and C4.5 DatasetsPCLC4.5 Single Bagged Boosted ALL vs others 0:0 1:2 0:0 1:2 AML vs others 0:0 1:0 0:0 0:0 MLL vs others 0:1 0:0 0:0 0:0 ALL vs AML 0:0 0:0 1:0 0:0 ALL vs MLL 0:0 0:0 0:0 0:0 AML vs MLL 0:0 0:0 0:0 0:0
Relapse Study on Breast Cancer Veer etal, Nature, Jan training samples, 19 test data for relapse study.
Test Error Rates by Classifiers DatasetPCLC4.5SVM3-NN Single Bagged Boosted Relapse vs Non-relapse 3:3 5:0 5:2 2:4 6:2 7:3 (12:7) Accuracy not good; may need other more sophisticated methods; Gene expression may be not sufficient for relapse study.
Summary Discussed similarities and differences between decision trees and emerging patterns. Discussed advanced topics such as bagging, boosting, convexity. Performance comparison using 4 gene expression datasets. Overall, PCL is better that C4.5 (single, bagged, or boosted) on accuracy and rules.
Thank you! May 27, 2002