Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology.

Slides:



Advertisements
Similar presentations
Decision Tree Approach in Data Mining
Advertisements

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.
Induction of Decision Trees
Three kinds of learning
Classification 10/03/07.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Learning (2), Tree and Forest
Decision Tree Learning
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Mohammad Ali Keyvanrad
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Appendix: The WEKA Data Mining Software
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
The Broad Institute of MIT and Harvard Classification / Prediction.
Feature Selection: Why?
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Decision Tree Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Classification Today: Basic Problem Decision Trees.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification with Gene Expression Data
An Artificial Intelligence Approach to Precision Oncology
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Decision Trees Greg Grudic
Volume 1, Issue 2, Pages (March 2002)
Gene expression profiling of pediatric acute myelogenous leukemia
Machine Learning: Lecture 3
Statistical Learning Dong Liu Dept. EEIS, USTC.
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology Singapore

Outline Introduction Brief history of decision trees and emerging patterns Basic ideas for decision trees and EPs Advanced topics Comparisons using gene expression data Summary

Introduction Decision trees and emerging patterns both classification rules Sharp discrimination power (no or little uncertainty) Advantage over black-box learning models Decision trees not the best (on accuracy) EP-based classifiers: competitive to the best

Brief History of Decision Trees CLS (Hunt etal. 1966)--- cost driven ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Pruning ideas CART (Breiman et al. 1984) --- Gini Index

Brief history of emerging patterns General EP (Dong & Li, 1999 Sigkdd) CAEP (Dong etal, DS99), JEP-C (Li etal, KAIS00) EP-space (Li etal, ICML00), DeEPs (Li etal, MLJ) PCL (Li & Wong, ECML02)

Basic definitions Relational data Attributes (color, gene_x), attribute values (red, 200.1), attribute-value pair (equivalently, condition, item) Patterns, instances Training data, test data

A simple dataset OutlookTempHumidityWindy class Sunny7570truePlay Sunny8090 trueDon’t Sunny8585 falseDon’t Sunny 7295trueDon’t Sunny6970falsePlay Overcast7290truePlay Overcast8378falsePlay Overcast6465truePlay Overcast8175falsePlay Rain7180trueDon’t Rain6570trueDon’t Rain 7580false Play Rain6880false Play Rain7096falsePlay 9 Play samples 5 Don’t A total of 14.

A decision tree outlook windy humidity Play Don’t sunny overcast rain <= 75 > 75 false true NP-complete problem 3

C4.5 A heuristic Using information gain to select the most discriminatory feature (for tree and sub- trees) Recursive subdivision over the original training data

Characteristics of C4.5 trees Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules Missing many globally significant rules; mislead the system.

Emerging Patterns (1) An emerging pattern is a set of conditions usually involving several genes, with which most of a class satisfy but none of the other class satisfy. Real example: {gene(37720_at) > 215, gene(38028_at)<=12}. 73% vs 0% EPs are multi-gene discriminators.

Emerging Patterns (2) An EP = {cond.1, cond.2, and cond.3} C1 C2 100 C1 Samples100 C2 Samples 80% 0%

Boundary emerging patterns Definition: A boundary EP is an EP whose proper subsets are not EPs. Boundary EPs separate EPs from non- EPs Distinguish EPs with high frequency from EPs with low frequency. Boundary EPs are of our greatest interests.

EP rules derived A total of 12 EPs, some important ones of them never discovered by C4.5. Examples: {Humi Play (5:0). A total of 5 rules in the decision tree induced by C4.5. C4.5 missed many important rules.

Characteristics of EP approach Each EP is a tree with only one branch A cluster of trees: EPs combined (loss of elegance) Globally significant rules Exponential number in size (need to focus on very important feature if large number of features exist.)

Usefulness of Emerging Patterns in Classification PCL (Prediction by Collective Likelihood of Emerging Patterns). Accurate Easily understandable.

Spirit of the PCL Classifier (1) Top-Ranked EPs in Positive class Top-Ranked EPs in Negative class EP_1 (90%) EP_2 (86%). EP_n (68%) EP_1 (100%) EP_2 (95%). EP_n (80%) The idea of summarizing multiple top-ranked EPs is intended to avoid some rare tie cases.

Spirit of the PCL Classifier (2) Score_p = EP’_1_p / EP_1_p + … + EP’_k_p / EP_k_p Most freq. EP from posi. class in the test sample Most freq. EP of posi class Similarly, Score_n = EP’_1_n / EP_1_n + … + EP’_k_n / EP_k_n If Score_p > Score_n, then positive class, otherwise negative class. K=10, Ideal Score_p = 10 Score_n = 0

C4.5 and PCL Differences: –C4.5 is a greedy search algorithm using the divide-and-conquer idea. –PCL is a global search algorithm. –Leaves in C4.5 are allowed to contain mixed samples, however PCL does not. –Multiple trees used by PCL, but single tree used by C4.5. Similarities: –Both can provide high- level rules.

Advanced topics Bagging, boosting and C4.5 Convexity of EP spaces (ICML00, Li etal). Decomposition of EP spaces into a series of P-spaces and a small convex space (ECML02, Li and Wong). DeEPs (to appear in MLJ, Li etal). Version spaces (Mitchell, 1982 AI).

Gene Expression Profiles Huge number of features Most of them can be ignored when for classification Many good discriminating feautures Number of instances relatively small

Expression data in this talk Prostate disease, 102 instances, Cancer Cell v.1, issue 1, 2002 ALL disease, 327 instances, Cancer Cell, v.1, issue 2, 2002 MLL disease, 72 instances, Nature Genetics, Jan., 2002 Breast cancer, 98 instances, Nature, Jan. 2002

Classification models in this talk K-nearest neighbor (simplest) C4.5 (decision tree based, easily understandable ) Bagged and Boosted C4.5 Support Vector Machines (black box) Our PCL classifier

Our Work Flow Original Training Data Feature selection Establishing Classification Model Giving a Test Sample Making a Prediction

Selecting Discriminatory Genes T-statistics and MIT-correlation. –Based on expression average and deviation between two classes of cell. Entropy-based discretization methods, including Chi-Square statistics, and CFS. –Based on clear boundaries in the expression range of a gene.

An Ideal Gene Expression Range Expression values are two-end distributed. C1C2 Xyz.x

Results on Prostate Dataset 52 tumor samples and 50 normal samples, each represented by ~12,500 numeric values. Two Problems: -1- What is the main difference? How to use rules to represent the difference? -2- What’s the LOOCV accuracy by PCL and C4.5?

C4.5 Tree 32598_at 40707_at 33886_at Tumor Normal <=29>29 <= 10 > 10 <= -6 > -6 > _at Normal <=

Emerging patterns PatternsFrequency (T)Frequency(N) {9, 36}38 instances0 {9, 23}380 {4, 9}380 {9, 14}380 {6, 9}380 {7, 21}036 {7, 11}035 {7, 43}035 {7, 39}034 {24, 29}034 Reference number 9: the expression of 37720_at > 215. Reference number 36: the expression of 38028_at <= 12.

LOOCV accuracies ClassifierPCLC4.5 SVM3-NN Single Bagged Boosted Accuracy Error rate

Subtype classification of Childhood Leukemia Using Gene Expression Profiling One of our important projects.

Collaborating Parties St. Jude Children’s Research Hospital, USA. –Mary E. Ross, Sheila A. Shurtleff, W. Kent Williams, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Relling, Anami Patel, Cheng Cheng, Dario Campana, Ching-Hon Pui, William E. Evans, Clayton Naeve, and James R. Downing NUH, Singapore. –Eng-Juh Yeoh University of Mississippi, USA. –Dawn Wilkins, Xiaodong Zhou LIT, Singapore. –Jinyan Li, Huiqing Liu, Limsoon Wong

Important Motivations Leukemia is a heterogeneous disease (T-ALL, E2A-PBX1, TEL-AML1, BCR- ABL, MLL, and Hyperdip>50). Response is different. Leukemia is 80% curable if subtype is correctly diagnosed. Correct diagnosis needs many different tests and experts. Tests and experts not commonly available in a single hospital, especially in less advanced countries. So, developing new methods is needed.

ALL Data Description Subtype Training Testing T-ALL E2A-PBX TEL-AML BCR-ABL 9 6 MLL 14 6 Hyperdip Others Total A sample = {gene_1, gene_2, …, gene_12558}

Central questions Diagnosis: Classification of more than six subtypes of the leukemia disease Prognosis: Prediction of outcome of therapy

Classification Strategy A new sample T-ALL? Yes  T-ALL predicted no E2A-PBX1? Yes  E2A-PBX1 predicted no TEL-AML1? Yes  TEL-AML1 predicted no BCR-ABL? Yes  BCR-ABL predicted no MLL? Yes  MLL predicted no Hyperdip50? Yes  Hyperdip50 predicted no

Total Mis-classifications of the 112 Test Samples Feature Selection SVM NB k-NN C4.5 PCL Top20-ChiSquare Entropy Top20-mit Overall, we achieved 96% accuracy level in the prediction.

An Excellent Publication Cancer Cell, March ‘02 Lead article Cover page

Other Questions for the Childhood Leukemia Can one subtype separating from other subtypes? Can we do classification in parallel, rather than using a tree structure?

Test Error Rates DatasetsPCL C4.5 (112 test instances)Single Bagged Boosted BCR-ABL vs others 1:04:4 6:0 4:4 E2A-PBX1 vs others 0:0 0:0 0:0 0:0 Hyperdip50 vs others 2:24:7 4:2 4:7 MLL vs others 0:22:2 1:0 2:2 T-ALL vs others 0:0 1:0 1:0 1:0 TEL-AML1 vs others 2:0 2:2 2:1 2:2 Parallel classification Overall, PCL is better than C4.5 (single, bagged, or boosted).

MLL Distinction from Conventional ALL Armstrong, etal, Nature Genetics, Jan classes, 57 training samples, 15 test instances. Much smaller than the St.Jude’s data. Independently obtained data.

Test Error Rates by PCL and C4.5 DatasetsPCLC4.5 Single Bagged Boosted ALL vs others 0:0 1:2 0:0 1:2 AML vs others 0:0 1:0 0:0 0:0 MLL vs others 0:1 0:0 0:0 0:0 ALL vs AML 0:0 0:0 1:0 0:0 ALL vs MLL 0:0 0:0 0:0 0:0 AML vs MLL 0:0 0:0 0:0 0:0

Relapse Study on Breast Cancer Veer etal, Nature, Jan training samples, 19 test data for relapse study.

Test Error Rates by Classifiers DatasetPCLC4.5SVM3-NN Single Bagged Boosted Relapse vs Non-relapse 3:3 5:0 5:2 2:4 6:2 7:3 (12:7) Accuracy not good; may need other more sophisticated methods; Gene expression may be not sufficient for relapse study.

Summary Discussed similarities and differences between decision trees and emerging patterns. Discussed advanced topics such as bagging, boosting, convexity. Performance comparison using 4 gene expression datasets. Overall, PCL is better that C4.5 (single, bagged, or boosted) on accuracy and rules.

Thank you! May 27, 2002