1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and.

1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and Computer Engineering

2 A. Supervised learning (notation)  x = (x 1,…,x m ) vector of input variables (numerical and/or symbolic)  y single output variable  Symbolic : classification problem  Numeric : regression problem  LS = ( (x 1,y 1 ),…,(x N,y N ) ), sample of I/O pairs  Learning (or modeling) algorithm  Mapping from sample sp. to hypothesis sp. H  Say : y = f(x) + e, where ‘e’ = modeling error  « Guess » ‘f LS ’ in H so as to minimize ‘e’

3 Statistical viewpoint  x and y are random variables distributed according to p(x,y)  LS is distributed according to p N (x,y)  f LS is a random function (selected in H)  e(x) = y – f LS (x) is also a random variable  Given a ‘metric’ to measure the error we can define the best possible model (Bayes model)  Regression : f B (x) = E(y|x)  Classification : f B (x) = argmax y P(y|x)

4 B. Crisp decision trees (what is it ?) X 1 <0.6 Y is big X 2 <1.5 Y is small Y is very big YesNo YesNo

5 B. Crisp decision trees (what is it ?) X 1 =0.6 X 2 =1.5

6 Tree induction (Overview)  Growing the tree (uses GS, a part of LS)  Top down (until all nodes are closed)  At each step  Select open node to split (best first, greedy approach)  Find best input variable and best question  If node can be purified split, otherwise close the node  Pruning the tree (uses PS, rest of LS)  Bottom up (until all nodes are contracted)  At each step  Select test node to contract (worst first, greedy…)  Contract and evaluate

7 Tree Growing  Demo : Titanic databaseTitanic database  Comments  Tree growing is a local process  Very efficient  Can select relevant input variables  Cannot determine appropriate tree shape  (Just like real trees…)

8 Tree Pruning  Strategy  To determine appropriate tree shape let tree grow too big (allong all branches), and then reshape it by pruning away irrelevant parts  Tree pruning uses global criterion to determine appropriate shape  Tree pruning is even faster than growing  Tree pruning avoids overfitting the data

9 Growing – Pruning (graphically) Tree complexity Error (GS / PS) Growing OverfittingUnderfitting Final tree Pruning

10 C. Soft trees (what is it ?)  Generalization of crisp trees using continuous splits and aggregation of terminal node predictions 0 1

11 Soft trees (discussion)  Each split is defined by two parameters  Position , and width  of transition region  Generalize decision/regression trees into a continuous and differentiable model w.r.t. the model parameters  Test nodes :  j  j  Terminal nodes : i  Other names (of similar models)  Fuzzy trees, continuous trees  Tree structured (neural, bayesian) networks  Hierarchical models

12 Soft trees (Motivations)  Improve performance (w.r.t. crisp trees)  Use of a larger hypothesis space  Reduced variance and bias  Improved optimization (à la backprop)  Improve interpretability  More « honest » model  Reduced parameter variance  Reduced complexity

13 D. Plan of the presentation  Bias/Variance tradeoff (in tree induction)  Main techniques to reduce variance  Why soft trees have lower variance  Techniques for learning soft trees

14 Concept of variance  Learning sample is random  Learned model is function of the sample  Model is also random  variance  Model predictions have variance  Model structure / parameters have variance  Variance reduces accuracy and interpretability Variance can be reduced by various ‘averaging or smoothing’ techniques

15 Theoretical explanation  Bias, variance and residual error  Residual error  Difference between output variable and the best possible model (i.e. error of the Bayes model)  Bias  Difference between the best possible model and the average model produced by algorithm  Variance  Average variability of model around average model  Expected error 2 : res 2 +bias 2 +var  NB: these notions depend on the ‘metric’ used for measuring error

16 Regression (locally, at point x) Find y’=f(x) such that E y|x {err(y,y’)} is minimum, where err is an error measure. Usually, err = squared error = (y- y’) 2  f(x)=E y|x {y} minimizes the error at every point x  Bayes model is the conditional expectation y p(y|x)

17 Learning algorithm (1) Usually, p(y|x) is unknown  Use LS = ( (x 1,y 1 ),…,(x N,y N ) ), and a learning algorithm to choose hypothesis in H  ŷ LS (x)=f(LS,x)  At each input point x, the prediction ŷ LS (x) is a random variable  Distribution of ŷ LS (x) depends on sample size N and on the learning algorithm used

18 Learning algorithm (2) Since LS is randomly drawn, estimation ŷ(x) is a random variable ŷ p LS (ŷ(x))

19 Good learning algorithm  A good learning algorithm should minimize the average (generalization) error over all learning sets  In regression, the usual error is the mean squared error. So we want to minimize (at each point x) Err(x)=E LS {E y|x {(y-ŷ LS (x)) 2 }}  There exists a useful additive decomposition of this error into three (positive) terms

20 Bias/variance decomposition (1) Err(x)= E y|x {(y- E y|x {y}) 2 } + … E y|x {y} = arg min y’ E y|x {(y- y’) 2 }} =Bayes model var y|x {y} = residual error = minimal error y var y|x {y} E y|x {y}

21 Bias/variance decomposition (2) Err(x) = var y|x {y} + (E y|x {y}-E LS {ŷ(x)}) 2 + … E LS {ŷ(x)} = average model (w.r.t. LS) bias 2 (x) = error between Bayes and average model ŷ E y|x {y} E LS {ŷ(x)} bias 2 (x)

22 Bias/variance decomposition (3) Err(x)= var y|x {y} + bias 2 (x) + E LS {(ŷ(x)-E LS {ŷ(x)}) 2 } var LS {ŷ(x)} = variance ŷ var LS {ŷ} E LS {ŷ}

23 Bias/variance decomposition (4) Local error decomposition Err(x) = var y|x {y} + bias 2 (x) + var LS {ŷ(x)} Global error decomposition (take average w.r.t. p(x)) E X {Err(x)} = E X {var y|x {y}} + E X {bias 2 (x)} + E X {var LS {ŷ(x)}} ŷ E y|x {y}E LS {ŷ(x)} bias 2 (x) var y|x {y}var LS {ŷ(x)}

24 Illustration (1)  Problem definition:  One input x, uniform random variable in [0,1]  y=h(x)+ε where ε  N(0,1) h(x)=E y|x {y} x

25 Illustration (2)  Small variance, high bias method E LS {ŷ(x)}

26 Illustration (3)  Small bias, high variance method E LS {ŷ(x)}

27 Illustration (Methods comparison)  Artificial problem with 10 inputs, all uniform random variables in [0,1]  The true function depends only on 5 inputs: y(x)=10.sin(π.x 1.x 2 )+20.(x 3 -0.5) 2 +10.x 4 +5.x 5 +ε, where ε is a N(0,1) random variable  Experimentation:  E LS  average over 50 learning sets of size 500  E x,y  average over 2000 cases  Estimate variance and bias (+ residual error)

28 Illustration (Linear regression)  Very few parameters : small variance  Goal function is not linear : high bias MethodErr 2 Bias 2 +NoiseVariance Linear regr.7.06.80.2 k-NN (k=1)15.4510.4 k-NN (k=10)8.57.21.3 MLP (10)2.01.20.8 MLP (10 – 10)4.61.43.2 Regr. Tree10.23.56.7

29 Illustration (k-Nearest Neighbors)  Small k : high variance and moderate bias  High k : smaller variance but higher bias MethodErr 2 Bias 2 +NoiseVariance Linear regr.7.06.80.2 k-NN (k=1)15.4510.4 k-NN (k=10)8.57.21.3 MLP (10)2.01.20.8 MLP (10 – 10)4.61.43.2 Regr. Tree10.23.56.7

30 Illustration (Multilayer Perceptrons)  Small bias  Variance increases with the model complexity MethodErr 2 Bias 2 +NoiseVariance Linear regr.7.06.80.2 k-NN (k=1)15.4510.4 k-NN (k=10)8.57.21.3 MLP (10)2.01.20.8 MLP (10 – 10)4.61.43.2 Regr. Tree10.23.56.7

31 Illustration (Regression trees)  Small bias, a (complex enough) tree can approximate any non linear function  High variance (see later) MethodErr 2 Bias 2 +NoiseVariance Linear regr.7.06.80.2 k-NN (k=1)15.4510.4 k-NN (k=10)8.57.21.3 MLP (10)2.01.20.8 MLP (10 – 10)4.61.43.2 Regr. Tree10.23.56.7

32 Variance reduction techniques  In the context of a given method:  Adapt the learning algorithm to find the best trade- off between bias and variance.  Not a panacea but the least we can do.  Example: pruning, weight decay.  Wrapper techniques:  Change the bias/variance trade-off.  Universal but destroys some features of the initial method.  Example: bagging.

33 Variance reduction: 1 model (1)  General idea: reduce the ability of the learning algorithm to over-fit the LS  Pruning  reduces the model complexity explicitly  Early stopping  reduces the amount of search  Regularization  reduce the size of hypothesis space

34 Variance reduction: 1 model (2)  Bias 2  error on the learning set, E  error on an independent test set  Selection of the optimal level of tuning  a priori (not optimal)  by cross-validation (less efficient) E=bias 2 +var bias 2 var Fitting Optimal fitting

35 Variance reduction: 1 model (3 )  As expected, reduces variance and increases bias  Examples:  Post-pruning of regression trees  Early stopping of MLP by cross-validation MethodEBiasVariance Full regr. Tree (488)10.23.56.7 Pr. regr. Tree (93)9.14.34.8 Full learned MLP4.61.43.2 Early stopped MLP3.81.52.3

36 Variance reduction: bagging (1)  Idea : the average model E LS {ŷ(x)} has the same bias as the original method but zero variance  Bagging (Bootstrap AGGregatING) :  To compute E LS {ŷ(x)}, we should draw an infinite number of LS (of size N)  Since we have only one single LS, we simulate sampling from nature by bootstrap sampling from the given LS  Bootstrap sampling = sampling with replacement of N objects from LS (N is the size of LS)

37 Variance reduction: bagging (2) LS LS 1 LS 2 LS k ŷ1(x)ŷ1(x)ŷ2(x)ŷ2(x)ŷk(x)ŷk(x) ŷ(x) = 1/k.(ŷ 1 (x)+ŷ 2 (x)+…+ŷ k (x)) x

38 Variance reduction: bagging (3)  Application to regression trees  Strong variance reduction without increasing bias (although the model is much more complex than a single tree) MethodEBiasVariance 3 Test regr. Tree14.811.13.7 Bagged11.710.71.0 Full regr. Tree10.23.56.7 Bagged5.33.81.5

39 Dual bagging (1)  Instead of perturbing learning sets to obtain several predictions, directly perturb the test case at the prediction stage  Given a model ŷ(.) and a test case x:  Form k attribute vectors by adding Gaussian noise to x: {x+ε 1, x+ε 2, …, x+ε k }.  Average the predictions of the model at these points to get the prediction at point x: 1/k.(ŷ(x+ε 1 )+ŷ(x+ε 2 )+…+ŷ(x+ε k )  Noise level λ (variance of Gaussian noise) selected by cross-validation

40 Dual bagging (2)  With regression trees:  Smooth the function ŷ(.).  Too much noise increases bias  there is a (new) trade-off between bias and variance Noise levelEBiasVariance 0.010.23.56.7 0.26.33.52.8 0.55.34.40.9 2.013.313.10.2

41 Dual bagging (classification trees) λ = 0  error =3.7 % λ = 1.5  error =4.6 % λ = 0.3  error =1.4 %

42 Variance in tree induction  Tree induction is among the ML methods of highest variance (together with 1-NN)  Main reason  Generalization is local  Depends on small parts of the learning set  Sources of variance:  Discretization of numerical attributes (60 %)  The selected thresholds have a high variance  Structure choice (10 %)  Sometimes, attribute scores are very close  Estimation at leaf nodes (30 %)  Because of the recursive partitioning, prediction at leaf nodes is based on very small samples of objects  Consequences:  Questionable interpretability and higher error rates

43 Threshold variance (1)  Test on numerical attributes : [a(o)<a th ]  Discretization: find a th which minimizes score  Classification: maximize information  Regression: minimize residual variance Score a th a(o)a(o)

44 Threshold variance (2)

45 Threshold variance (3)

46 Tree variance  DT/RT are among the machine learning methods which present the highest variance MethodEBiasVariance RT, no test25.525.40.1 RT, 1 test19.017.71.3 RT, 3 tests14.811.13.7 RT, full (250 tests)10.23.56.7

47 DT variance reduction  Pruning:  Necessary to select the right complexity  Decreases variance but increases bias : small effect on accuracy  Threshold stabilization:  Smoothing of score curves, bootstrap sampling…  Reduces parameter variance but has only a slight effect on accuracy and prediction variance  Bagging:  Very efficient at reducing variance  But jeopardizes interpretability of trees and computational efficiency  Dual bagging:  In terms of variance reduction, similar to bagging  Much faster and can be simulated by soft trees  Fuzzy tree induction  Build soft trees in a full fledged approach

48 Dual tree bagging = Soft trees  Reformulation of dual bagging as an explicit soft tree propagation algorithm  Algorithms  Forward-backward propagation in soft trees  Softening of thresholds during learning stage  Some results

49 Dual bagging = soft thresholds  x+ε<x th  sometimes left, sometimes right  Multiple ‘crisp’ propagations can be ‘replaced’ by one ‘soft’ propagation  E.g. if ε has uniform pdf in [a th - /2,a th + /2 ] then probability of right propagation is as follows a th TS left TS right

51 Learning of values  Use of an independent ‘validation’ set and bisection search  One single value can be learned very efficiently (amounts to 10 full tests of a DT/RT on the validation set)  Combination of several values can also be learned with the risk of overfitting  (see fuzzy tree induction, in what follows)

52 Some results with dual bagging

53 Fuzzy tree induction  General ideas  Learning Algorithm  Growing  Refitting  Pruning  Backfitting

54 General Ideas  Obviously, soft trees have much lower variance than crisp trees  In the « Dual Bagging » approach, attribute selection is carried out in a cloassical way, then tests are softened in a post-processing stage  Might be more effective to combine the two methods ê Fuzzy tree induction

55 Soft trees  Samples are handled as fuzzy subsets  Each observation belongs to such a FS with a certain membership degree  SCORE measure is modified  Objects are weighted by their membership degree  Output y  Denotes the membership degree to a class  Goal of Fuzzy tree induction  Provide a smooth model of ‘y’ as a function of the input variables

56 Fuzzy discretization  Same as fuzzification  Carried out locally, at the tree growing stage  At each test node  On the basis of local fuzzy sub-training set Select attribute, together with discriminator so as to maximize local SCORE Split in soft way and proceed recursively  Criteria for SCORE  Minimal residual variance  Maximal (fuzzy) information quantity  Etc

57 Attaching labels to leaves  Basically, for each terminal node, we need to determine a local estimate ŷ i of y  During intermediate steps  Use average of ‘y’ in local sub-learning set  Direct computation  Refitting of the labels  Once the full tree has been grown and at each step of pruning Determine all values simultaneously To minimize Square Error  Amounts to a linear least squares problem  Direct solution

58 Refitting (Explanation)  A leaf corresponds to a basis function  i (x)  Product of discriminators encountered on the path from root  Tree prediction is equivalent to a weighted average of these basis functions  y (x) = ŷ 1 *  1 (x) + ŷ 2 *  2 (x) + … + ŷ k *  k (x)  the weights ŷ i are the labels attached to the terminal nodes  Refitting amounts to tune the ŷ i parameters to minimize square error on training set

59 Tree growing and pruning  Grow tree  Refit leaf labels  Prune tree, while refitting at each stage leaf labels  Test sequence of pruned trees on validation set  Select best pruning level

60 Backfitting (1)  After growing and pruning, the fuzzy tree structure has been determined  Leaf labels are globally optimal, but not the parameters of the discriminators (tuned locally)  Resulting model has 2 parameters per test node, and 1 parameter per terminal node  The output (and hence Mean square error) of the fuzzy tree is a smooth function of these parameters  The parameters can be optimized, by using a standard LSE technique, e.g. Levenberg-Marquardt

61 Backfitting (2)  How to compute the derivatives needed by nonlinear optimization technique  Use a modified version of backpropagation to compute derivates with respect to parameters  Yields an efficient algorithm (linear in the size of the tree)  Backfitting starts from tree produced after growing and pruning  Already a good approximation of a local optimum  Only a small number of iterations are necessary to backfit  Backfitting may also lead to overfitting…

62 Summary and conclusions  Variance is the problem number one in decision/regression tree induction  It is possible to reduce variance significantly  Bagging and/or tree softening  Soft trees have the advantage of preserving interpretability and computational efficiency  Two approaches have been presented to get soft trees  Dual bagging  Generic approach  Fast and simple  Best approach for very large databases  Fuzzy tree induction  Similar to ANN type of model, but (more) interpretable  Best approach for small learning sets (probably)

63 Some references for further reading  Variance evaluation/reduction, bagging  Contact : Pierre GEURTS (PhD student) geurts@montefiore.ulg.ac.be geurts@montefiore.ulg.ac.be  Papers  Discretization of continuous attributes for supervised learning - Variance evaluation and variance reduction. (Invited) L. Wehenkel. Proc. of IFSA'97, International Fuzzy Systems Association World Congress, Prague, June 1997, pp. 381--388.  Investigation and Reduction of Discretization Variance in Decision Tree Induction. Pierre GEURTS and Louis WEHENKEL, Proc. of ECML’2000  Some Enhancements of Decision Tree Bagging. Pierre GEURTS, Proc. of PKDD’2000  Dual Perturb and Combine Algorithm. Pierre GEURTS, Proc. of AI and Statistics 2001.

64 See also www.montefiore.ulg.ac.be/services/stochastic/ www.montefiore.ulg.ac.be/services/stochastic/  Fuzzy/soft tree induction  Contact : Cristina OLARU (PhD student) olaru@montefiore.ulg.ac.be olaru@montefiore.ulg.ac.be  Papers  Automatic induction of fuzzy decision trees and its application to power system security assessment. X. Boyen, L. Wehenkel, Int. Journal on Fuzzy Sets and Systems, Vol. 102, No 1, pp. 3-19, 1999.  On neurofuzzy and fuzzy decision trees approaches. C. Olaru, L. Wehenkel. (Invited) Proc. of IPMU'98, 7th Int. Congr. on Information Processing and Management of Uncertainty in Knowledge based Systems, 1998.

1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and.

Similar presentations

Presentation on theme: "1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and.

Similar presentations

Presentation on theme: "1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and."— Presentation transcript:

Similar presentations

About project

Feedback