Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Decision Tree on a Prostate Data Set Singh et.al, Cancer Cell 1:203-209, 2002 102 instances 52 tumor samples 50 normal samples ~12,500 numeric features –Each one represents a gene (or probe) –Its value is expression level of that gene Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule Translation The tree can be translated into 5 rules Two of them are significant rules, but the rest three are trivial The two significant rules dominate in the two classes: normal class and tumor class 32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong

32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong Significance of the Rules Two significant rules –If x <= 29 and y <=10 and z <= 5, then this is a tumor cell (94%), where x, y, z represent 32598_at, 33886_at, 34950_at respectively –If x > 29 and 40707_at > - 6, then this is a normal cell (82%) Three trivial rules: 12%, 6%, 6%

Another Gene Expression Data Set Yeoh et al., Cancer Cell 1:133-143, 2002 Differentiating MLL subtype from other subtypes of childhood leukemia Training data –14 MLL vs 201 others Test data –6 MLL vs 106 others Number of features –12558 Copyright © 2004 by Jinyan Li and Limsoon Wong

Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision! Translating the Tree into a Mathematical Function

Copyright © 2004 by Jinyan Li and Limsoon Wong Four Points to Demonstrate Whether top-ranked features have similar gain ratios Whether cascading trees have similar training performance Whether the trees have similar structure Whether the expanding tree committees can reduce the test errors gradually

Copyright © 2004 by Jinyan Li and Limsoon Wong Gain Ratios of Top 20 features Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28. The difference between the 1st and the 20th is only 0.11. In fact, the two features’ partitionings differ in a few samples

Copyright © 2004 by Jinyan Li and Limsoon Wong Two Observations The first tree does not always have the best performance Alternative trees rooted by other top-ranked features may have better performance than the first tree

Compared to Bagging & Boosting Bagging made similar number of mistakes: 2 mistakes However, Boosting made 13 mistakes

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Similar presentations

Presentation on theme: "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Similar presentations

Presentation on theme: "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."— Presentation transcript:

Similar presentations

About project

Feedback