Download presentation
Presentation is loading. Please wait.
Published byLisa Shannon Dennis Modified over 9 years ago
1
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong
2
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 4: Interesting Rules and Patterns
3
Copyright © 2004 by Jinyan Li and Limsoon Wong Outline Some interesting decision trees Performance of CS4 Demo
4
Copyright © 2004 by Jinyan Li and Limsoon Wong Some Interesting Decision Trees
5
Decision Tree on a Prostate Data Set Singh et.al, Cancer Cell 1:203-209, 2002 102 instances 52 tumor samples 50 normal samples ~12,500 numeric features –Each one represents a gene (or probe) –Its value is expression level of that gene Copyright © 2004 by Jinyan Li and Limsoon Wong
6
32598_at 40707_at 33886_at Tumor Normal <=29>29 <= 10 > 10 <= -6 > -6 > 5 34950_at Normal <=5 3(+1) 6 C4.5 Tree Copyright © 2004 by Jinyan Li and Limsoon Wong
7
Rule Translation The tree can be translated into 5 rules Two of them are significant rules, but the rest three are trivial The two significant rules dominate in the two classes: normal class and tumor class 32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong
8
32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong Significance of the Rules Two significant rules –If x <= 29 and y <=10 and z <= 5, then this is a tumor cell (94%), where x, y, z represent 32598_at, 33886_at, 34950_at respectively –If x > 29 and 40707_at > - 6, then this is a normal cell (82%) Three trivial rules: 12%, 6%, 6%
9
Another Gene Expression Data Set Yeoh et al., Cancer Cell 1:133-143, 2002 Differentiating MLL subtype from other subtypes of childhood leukemia Training data –14 MLL vs 201 others Test data –6 MLL vs 106 others Number of features –12558 Copyright © 2004 by Jinyan Li and Limsoon Wong
10
4 mistakes on test data The Decision Tree Copyright © 2004 by Jinyan Li and Limsoon Wong
11
Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision! Translating the Tree into a Mathematical Function
12
Copyright © 2004 by Jinyan Li and Limsoon Wong Performance of CS4
13
Copyright © 2004 by Jinyan Li and Limsoon Wong Four Points to Demonstrate Whether top-ranked features have similar gain ratios Whether cascading trees have similar training performance Whether the trees have similar structure Whether the expanding tree committees can reduce the test errors gradually
14
Copyright © 2004 by Jinyan Li and Limsoon Wong For differentiation between the subtype Hyperdip>50 and some other subtypes of childhood leukemia An Example
15
Copyright © 2004 by Jinyan Li and Limsoon Wong Gain Ratios of Top 20 features Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28. The difference between the 1st and the 20th is only 0.11. In fact, the two features’ partitionings differ in a few samples
16
Copyright © 2004 by Jinyan Li and Limsoon Wong Training and Test Performance
17
Copyright © 2004 by Jinyan Li and Limsoon Wong Two Observations The first tree does not always have the best performance Alternative trees rooted by other top-ranked features may have better performance than the first tree
18
The Power of Committee Copyright © 2004 by Jinyan Li and Limsoon Wong
19
Compared to Bagging & Boosting Bagging made similar number of mistakes: 2 mistakes However, Boosting made 13 mistakes
20
Copyright © 2004 by Jinyan Li and Limsoon Wong Demo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.