Fast Decision Tree Learning Techniques for Microarray Data Collections

Fast Decision Tree Learning Techniques for Microarray Data Collections
Xiaoyong Li and Christoph F. Eick Department of Computer Science, University of Houston Talk Organization Introduction Microarray Data Collections Attribute Histogram based Techniques Speeding Up Leave-one-out Cross Validation Evaluation Related Ideas Summary

Microarray Experiments

cDNA Gene Expression Data
Data on G genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 3 Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity)

Applications of Microarrays to Tumor Classification
Class Prediction supervised: assign samples to known classes ex. support vector machines, discriminant analysis, k-nearest neighbor, etc. Class Discovery unsupervised: find new classes from the data ex. hierarchical clustering, principal component analysis, self organizing maps, etc. Gene Discovery

Characteristics of Microarray Data Collections
Very large number of numerical attributes (e.g. 7000) Typically, a small number of examples (e.g. 100) Attributes are numerical

Goals of the Research Design and implement a decision tree learning algorithm that is well suited for microarray data collections Develop efficient algorithms for leave-one-out cross validation for decision trees Try to find “short cuts” in decision tree learning algorithms to increase speed. Remark: focus is on binary classification problems

Classification Tree

Decision Trees Partition the feature space into a set of rectangles, then fit a simple model in each one Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

How to Determine Node Tests
Entropy function (C4.5): Information gain :

Node Tests for Numerical Attributes
Standard method: binary splits (i.e. temp < 45) In contrast to nominal attributes, numerical attributes offer many possible split points Solution is straightforward extension: Evaluate info gain (or other measure) for every possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute Computationally more demanding Other work (e.g. [FaIr93]) suggests to use non-binary splits for numerical attributes

An Example Split on temperature attribute from weather data:
| Eg. 4 +’s and 2 -‘s no for temperature < 71.5 and 5 +’s and 3 -‘s for temperature  71.5 Gain([4,2],[5,3]) = E([9,5]) - (6/14)*E([4,2]) + (8/14)*E([5,3]) Split points are placed halfway between values All split points can be evaluated in one pass!

Attribute Histogram – a Special Data Structure
Figure a. A set of attribute values sorted into ascending order according to the numerical value. The class labels of the example are also shown. (P: +, N: -) Figure b. The original set of 7 examples can be reduced to 4 blocks. Only the class frequencies within blocks need to be know.

Attribute Histogram – a Special Data Structure, cont’d
Figure c. For the current histogram, only 3 split points need to be considered

Using Attribute Histograms for Decision Tree Induction
Compare splitting att1= (1|1|1|1|1|1|1|1|1|1|1|1) vs. splitting att2= (3|3|3|3); for the classical decision tree learning algorithm both split points are equally bad; but, we claim that, att1 is much more suitable to be used in a node test than att2. Idea: Develop heuristics based on attribute histograms (instead of information gain), we compute the “hist index” based on the attribute histogram (uses the fact that a2b2+c2 --- a=b+c; b,c>0): Hist(S) = Si Pi2 Hist(att1)= … =12; Hist(att2)=4*32 Hist is used in our work for attribute pruning; attributes with “flip-flopping” class memberships (that is, low hist index) are no longer considered for node tests.

Using Attribute Histograms for Decision Tree Induction(continued)
Hist has to be generalized to cope with duplicate attribute values having different class memberships; e.g. att1 In this case, we receive the following histogram: (1|1|3|1|1/3|2|2|1); Hist(att1)=12 + …+ 1+1+|3-1| … Efficient algorithms are needed to propagate attribute histograms during decision tree generation.

Leave-one-out Cross Validation
Why using leave-one-out cv? Limited datasets, especially true for microarray data collections Advantage: Bias-free Disadvantage: Computational overhead

Approaches to Improve Leave-on-out Efficiency
Build the tree from the whole dataset first; then construct trees for datasets missing a particular example. Reuse sub-trees from the previous runs (using hashing techniques to find those quickly) Reuse of the previous computations or split points Use of approximations when computing information gain --- we compute absolute differences between class cardinalities instead; e.g.: Gain(3|5|7|2|4)  |+3-5|+|+7-2+4|=2+9=11; Gain(3|5|7|2|4)  |+3-5+7|+|-2+4|=5+2=7;

Evaluation C5.0 (was run with default parameter settings)
Microarray Decision Tree (our implementation; also uses prepruning) Optimized Microarray Decision Tree Tool (approximate information game computations, attribute pruning, reuse of sub-trees,)

Used datasets- 1. Leukemia
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring (Golub et al., 1999). n = 72 mRNA samples, two classes: Acute myeloid leukemia (AML) 25 cases Acute lymphoblastic leukemia (ALL) cases P = 6,187 genes

Used datasets- 2. Colon tissue
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays (Alon et al., 1999). n = 62 mRNA samples, two classes: Normal tissues 22 cases Tumor tissues 40 cases P  6,500 genes

Used datasets- 3. Breast cancer
Gene expression profiling predicts clinical outcome of breast cancer (Veer et al., 2002). n = 78 mRNA samples, two classes: Greater than 5 years disease-free 44 cases Less than 5 years disease-free 34 cases P  25,000 genes

Gene Pre-selection ([DSF2003])
Removing genes based on the ratio of their between-groups to within-groups sum of squares. For a particular gene j, the ratio is defined as: = After we calculate all the BSS/WSS ratios, only the p genes with the largest ratios will be input to our classifiers. Idea BSS/WSS Removal: gene j is removed if the class means for gene j are close to the overall mean (low BSS value) and/or if the the individual examples are far away from their class mean (high WSS value)

CPU time comparison of three different decision tree tools

Accuracy for Tools and Data Sets

Beta adaptin protein mRNA Beta adaptin protein mRNA
See5 45/72 Leukemia: 1024 genes used Gene906 <= 2.66 Beta adaptin protein mRNA GB DEF = CD190 protein Yes No -- Gene667 <= 3.18 Yes No + Beta adaptin protein mRNA + 27/72 Gene906 <= 2.66 No Yes + Error rate = 5/72 --

Beta adaptin protein mRNA
DTM 72/72 Leukemia: 1024 genes used Gene906 <= 2.66 Beta adaptin protein mRNA Yes No + Gene30 <= 2.33 No Yes PAGA Proliferation-associated gene A (natural killer-enhancing factor A) -- + Error rate = 6/72

Related Ideas 64+| | 71+| | | 81- |83+| Use one dimensional clustering based on histograms (determine a subset of the available split points that maximizes class purity and minimizes the number of clusters); then create, n-ary splits based on the obtained clusters;  current work. We believe that the clustering with respect to class membership for an attribute seem to result in much better predictors of attribute usefulness than traditional statistical techniques, such as information gain or statistics based on class mean (such as BSS/WSS). The hist index can be used to replace / in conjunction with information gain to determine node tests; e.g. Gain((2|4|8|2))=6/16*Hist(2|4) + 10/16*Hist(8|2)= … + 10/16*68 Gain((2|4|4|1|4|1))=6/16*Hist(2|4) + 10/16*Hist(4|1|4|1)= … + 10/16*34

Summary Attribute histograms seem to be very useful for attribute pruning and node test selection, but more empirical work is needed to confirm this hypothesis. Reuse of sub-trees, attribute pruning, and approximate information gain computations resulted in a significant speedup for leave-one-out cross-validation.

Colon Tissue: 1200 genes used
See5 Colon Tissue: genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene223 <=1.93 CREB-BINDING PROTEIN (Mus musculus) Human cysteine-rich protein (CRP) gene, exons 5 and 6. Yes No + Gene10 <= 2.87 No Yes ATP SYNTHASE A CHAIN (Trypanosoma brucei brucei) + Gene565 <=1.99 Yes No Error rate = 12/62 -- +

Colon Tissue : 1200 genes used
See5 Colon Tissue : genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene586 <=1.43 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene492 <= 2.87 No Yes ESTROGEN SULFOTRANSFERASE (Bos taurus) -- + Error rate = 12/62

See5 Colon Tissue : genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene586 <=1.43 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene312 <= 3.07 No Yes -- FERRITIN LIGHT CHAIN (HUMAN + Error rate = 12/62

See5 Colon Tissue: genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene223 <=1.93 CREB-BINDING PROTEIN (Mus musculus) Human cysteine-rich protein (CRP) gene, exons 5 and 6. Yes No + Gene10 <= 2.87 No Yes -- + Error rate = 12/62

See5 Colon Tissue : genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene7 <=2.89 MYOSIN REGULATORY LIGHT CHAIN 2, SMOOTH MUSCLE ISOFORM (HUMAN);contains element TAR1 No Yes + Gene1074 <=1.14 No Yes G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) -- + Error rate = 12/62

See5 Colon Tissue : genes used Gene3 <= 1.76 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene223 <=1.93 G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) ) CREB-BINDING PROTEIN (Mus musculus) Yes No + Gene2 <= 3.4 No Yes -- + Error rate = 12/62

DMT Colon Tissue : genes used Gene2 <= 3.23 Yes No G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) ) + -- Gene1 <=2.58 Yes No Yes Gene210 <= 2.19 No -- + MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) Human mRNA for ORF, complete cds. Error rate = 13/62

DMT Colon Tissue : genes used Gene2 <= 3.23 G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) ) Yes No + -- Gene1 <=2.58 Yes No Yes Gene83 <= 1.625 No -- + ACTIN, AORTIC SMOOTH MUSCLE (HUMAN) MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) Error rate = 13/62

DMT Colon Tissue : genes used Gene2 <= 3.23 G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) ) Yes No + -- Gene1 <=2.58 Yes No Yes Gene81 <= 1.405 No -- + MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) RIBOSE-PHOSPHATE PYROPHOSPHOKINASE I (HUMAN) Error rate = 13/62

DMT Colon Tissue: genes used Gene3 <= 1.78 Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds. Yes No + Gene2 <=3.415 G1/S-SPECIFIC CYCLIN D2 (Homo sapiens) ) No Yes + Gene81 <= 1.405 Yes No -- + RIBOSE-PHOSPHATE PYROPHOSPHOKINASE I (HUMAN) Error rate = 12/62

Fast Decision Tree Learning Techniques for Microarray Data Collections

Similar presentations

Presentation on theme: "Fast Decision Tree Learning Techniques for Microarray Data Collections"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Decision Tree Learning Techniques for Microarray Data Collections

Similar presentations

Presentation on theme: "Fast Decision Tree Learning Techniques for Microarray Data Collections"— Presentation transcript:

Similar presentations

About project

Feedback