Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department of Computer Sciences University of Wisconsin, Madison USA

Main Contribution Greedy tree learning algorithms suffer from myopia This is remedied by Lookahead, which is computationally very expensive We present an approach to efficiently address the myopia of tree learners

Task Setting Given: m examples of n Boolean attributes each, labeled according to a function f over some subset of the n attributes Do: Learn the Boolean function f

TDIDT Algorithm Top Down Induction of Decision Trees Greedy algorithm Chooses feature that locally optimizes some measure of “purity” of class labels Information Gain Gini Index

TDIDT Example +110 −010 −101 +000 Valuex3x3 x2x2 x1x1

x1x1 x 1 =0 (2+, 1−) x 1 =1 (1−) ─ x 2 =0 (1+) x 2 =1 (1+,1−) x2x2 TDIDT Example

Outline Introduction to TDIDT algorithm Myopia and “Hard” Functions Skewing Experiments with Skewing Algorithm Sequential Skewing Experiments with Sequential Skewing Conclusions and Future Work

Myopia and Correlation Immunity For certain Boolean functions, no variable has “gain” according to standard purity measures (e.g., entropy, Gini) No variable correlated with class In cryptography, correlation immune Given such a target function, every variable looks equally good (bad) In an application, the learner will be unable to differentiate between relevant and irrelevant variables

A Correlation Immune Function f=x 1  x 2 x2x2 x1x1 011 101 110 000

Examples In Drosophila, Survival is an exclusive-or function of Gender and the expression of the SxL gene In drug binding (Ligand-Domain), binding may have an exclusive-or subfunction of Ligand Charge and Domain Charge

Learning Hard Functions Standard method of learning hard functions with TDIDT: depth-k Lookahead O(mn 2 k+1 -1 ) for m examples in n variables Can we devise a technique that allows TDIDT algorithms to efficiently learn hard functions?

Key Idea Correlation immune functions aren’t hard – if the data distribution is significantly different from uniform

Example Uniform distribution can be sampled by setting each variable (feature) independently of all others, with probability 0.5 of being set to 1 Consider a distribution where each variable has probability 0.75 of being set to 1.

Example x1x1 x2x2 x3x3 f 00 0 0 1 01 0 1 1 10 0 1 1 11 0 0 1

x1x1 x2x2 x3x3 fWeightSum 00 0 0 1 01 0 1 1 10 0 1 1 11 0 0 1

Example x1x1 x2x2 x3x3 fSum 00 0 0 1 01 0 1 1 10 0 1 1 11 0 0 1

Example x1x1 x2x2 x3x3 fSum 00 0 0 1 01 0 1 1

Example x1x1 x2x2 x3x3 fSum 10 0 1 1 11 0 0 1

Example x1x1 x2x2 x3x3 fWeight 00 00 01 01 10 01 11 00

Key Idea Given a large enough sample and a second distribution sufficiently different from the first, we can learn functions that are hard for TDIDT algorithms under the original distribution.

Issues to Address How can we get a “sufficiently different” distribution? Our approach: “skew” the given sample by choosing “favored settings” for the variables Not-large-enough sample effects? Our approach: Average “goodness” of any variable over multiple skews

Skewing Algorithm For T trials do Choose a favored setting for each variable Reweight the sample Calculate entropy of each variable split under this weighting For each variable that has sufficient gain, increment a counter Split on the variable with the highest count

Experiments ID3 vs. ID3 with Skewing (ID3 to avoid issues to do with parameters, pruning, etc.) Synthetic Propositional Data Examples of 30 Boolean variables Target Boolean functions of 2-6 of these variables Randomly chosen targets and randomly chosen hard targets UCI Datasets ( Perlich et al, JMLR 2003 ) 10 fold cross validation Evaluation metric: Weighted Accuracy = average of accuracy over positives and negatives

Results (3-variable Boolean functions) Random functionsHard functions

Current Shortcomings Sensitive to noise, high-dimensional data Very small signal on the hardest CI functions (parity) given more than 3 relevant variables Only very small gains on real-world datasets attempted so far Few correlation immune functions in practice? Noise, dimensionality, not enough examples?

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Similar presentations

Presentation on theme: "Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Similar presentations

Presentation on theme: "Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department."— Presentation transcript:

Similar presentations

About project

Feedback