1 Data Mining dr Iwona Schab Decision Trees
2 Method of classification Recursive procedure which (progressively) divides sets of n units into groups accoridng to a division rule Designed for supervised prediction problems (i.e. a set of input variables is used to prodict the value of target variable The primary goal is prediction The fitted tree model is used for target variable prediction for new cases (i.e. to score new cases/data) Result: a final partition of the observations the Boolean rules needed to score new data
3 Decision Tree A predictive model represented in a tree-like structure Root node A split based on the values of the input Terminal node – the leaf Internal node
4 Decission tree Nonparametric method Allows for nonlinear relationships modelling Sound concept, Easy to interpret Robustness against outliers Detection and taking into accout of potential interactions between input variables Additional implementation: categorisation of continiuos variables, grouping of nopminal valueds
5 Decision Trees Types: Classification trees (Categorical response variable) the leafs give the predicted class and the probability of class membership Regression trees (Continous response variable) the leafs give the predicted value of the target Exemplary applications: Handwriting recognition Medical research Financial and capital markets
6 Decision Tree The path to each leaf expresses as a Boolean rule: if … then … The ’regions’ of the input space determined by the split values Intersections of subspaces defined by a single splitting variable Regression tree model is a multivariate step function Leaves represent the perdicted target All cases in a particular leaf are given the same predicted target Splits: Binary Multiway splits (inputs partitioned into disjoined ranges)
7 Analytical decision Recursive partitioning rule / splitting criterion Pruning criterion / stopping criterion Assignement of predicted target variable
8 Recursive partitioning rule Method used to fit the tree Top-dow, greedy algorithm Starts at the root node Splits involving each single input are examined Disjoint subsets of nominal inputs Disjoint ranges of ordinal / interval inputs The spliting criterion Measures the reduction in variability of the target distribution in the child node used to choose the split The split choosed determines the partitioning of the observations Partition repeted in each child node as if it were a root node of a new tree The partition continues deeper in the tree – the process is repeated recursively until is stopped by the stopping rule
9 Splits on (at least) ordinal input
10 Splits on nominal input
11 Binary splits
12 Partitioning rule – possible variations Incorporating some type of look-ahead or backup Often produce inferior trees have not been shown to be an improvement, Murthy and Salzberg, 1995) Oblique splits Splits on lienear combination of inputs (as apposite to the standard coordinte-axis splits. i.e. boundaries parallel to the input coordinates)
13 Recursive partitioning alghorithm
14 Stopping criterion Governs the depth and complexity of the tree Right balance bewteen depth and complexity When the tree is to complex: Perfect discriminantion in the training sample Lost stability Lost ability to generalise discovered patterns and relations Overfitted to the trainig sample Difficulties with interpretation of prodictive rules Trade-off beetwen the adjustment to the training sample and ability to generalise
15 Splitting criterion Impurity reduction Chi-square test An exhaustive tree algorithm considers: all possible partitions Of all inputs At every node combinatorial explosion
16 Spliting criterion Minimise impurity within child nodes / maximise differencies between newly splited child nodes chose the split into child nodes which: maximises the drop in inpurity resulting from the parnets node partition Maximises difference between nodes Measures of impurity: Basic ratio Gini impurity index Entropy Measures of difference Based on relative frequencies (classification tree) Based on target variance (regression tree)
17 Binary Decision trees Nonparamemetric model no assumptions regarding distribution needed Classifies observations into pre-defined groups target variable predited for the whole leafe Supervised segmentation In the bacis case: recoursive partition into two separate categories in order to maximise similarities of observation within the leaf and maximise differencies between leaves Tree model = rules of segmentation No previous selection of input variable
18 Trees vs hierarchical segmentation Hierarchical segmentation Descriptive apparoach Unsupervised classification Segmentation based on all variables Each partitioning based on all variable at the time – based on distance measure Trees Predictive appraoch Supervised classification Segmentation based on target variable Each partitioning based on one variable at the time (usually)
19 Requirements Large data sample In case of classification trees: sufficient number of cases falling into each class of target (suggeested: min 500 cases per class)
20 Stopping criterion The node reaches pre-defined size (e.g 10 or less cases) The algorithm has run the predefined number of generations The split results in (too) small drop of impurity Expectes losses in the testing sample Stability of resuls in the testing sample Probabilistic assumptions regarding the variables (e.g. CHAID algorithm)
21 Target assignement to the leaf
22 Disadvantages Lack of stability (often) Stability assessment on the basis of testing sample, without formal statistical inference In case of classification tree: target value calculated in the separate step with a „simplistic” method ( dominating frequency assignement) Target value calculated on the leaf level, not on the individual observation level
Drop of impurity ΔI Basic Impurity Index Average impurity of child nodes Spliting Example
Gini Impurity Index Entropy Pearson’s test for relative frequencies Spliting Example
Age#G#G#B#B Odds (of beeing good) Young : 1 Medium : 1 Older : 1 Total : 1 How to split the ordinal (in this case) variable „age”? (young+older) vs. medium? (young+medium) vs. older? Spliting Example
1. Young + Older= r versus Medium = l I(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3 I(r) = 300/1400 I(l) = 100/600 Spliting Example
2. Young + Medium= r versusOlder = l i(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2 I(r) = 300/1600 I(l) = 100/400 Spliting Example
1. Young + Older= r versus Medium = l p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3 Spliting Example
2. Young + Medium= r versus Older= l p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2 Spliting Example