Martin Ralbovský KIZI FIS VŠE
The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics One of the main principles of the method is to provide “everything interesting” to the user.
Decision trees One of the most known classification methods There are several known algorithms for construction of decision trees (ID3, C4.5…) Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute Best decision tree in output
Making decision trees GUHA (Petr Berka) A decision tree can be viewed as a GUHA verification/hypothesis But there is only 1 tree in the output Modification of the initial algorithm – ETree procedure We do not branch according to best attribute, but according to n best attributes In each iteration, nodes suitable for branching from existing trees are selected and branched Only sound decision trees to output
ETree parameters (Petr Berka) Criterion for attribute ordering - 2 Trees: Maximal tree depth (parameter) Allow only full length trees Number of attributes for branching Branching: Minimal node frequency Minimal node purity Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: Confusion matrix, F-Measure + any 4ft-quantifier in Ferda
How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} A A B B C C A B B C C Ad a) A A B B C C A A B B C C Ad b) A A B B C C A A B B C C
How to branch II A A B B C C A A B B C C 1 1 A A B B C C 2 2 A A B B C C A A B B C C Ad a)Ad b)
Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); }
Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used Data preparation boxes 4ft-quantifiers can be used to measure quality of trees MiningProcessor (bit string generation engine) usage
ETree task example Existing data preparation boxes 4ft-quantifiers ETree task box
Output + settings example …
Experiment 1 - Barbora Barbora bank, cca cliends, classification of client status from: Loan amount Client district Loan duration Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)
Results - Barbora Tree DepthF-ThresholdVerificationsHypothese s Best hypothesis Performance: 36 verifications/sec
Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: Wilderness area Elevation Slope Horizontal + vertical distance to hydrology Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)
Results – Forest tree cover Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Attributes for branching: 1 Performance: 39VPS Attributes for branching: 3 Performance: 86VPS Attributes for branching: 5 Performance: 71VPS
Experiment 3: Forest tree cover Construction of trees for whole dataset (cca. 600K) Does increase of number attributes for branching result in better trees? Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 Best hypothesis: 0.52, 2VPS(strings in cache)
Verifications 4FT vs. ETree On tasks on similar data table length 4FT (in Ferda) approx VPS ETree about 70 VPS The ETree verification is far more complicated: In addition to computing quantifier, counting 2 for each node suitable for branching Hard operations (sums) instead of easy operations (conjunctions…) Not only verification of a tree, but construction of trees derived from this tree
Further work How new/known is the method? Boxes for attribute selection criteria Classification box Better result browsing + result reduction Optimizing Elective classification - tree has a vote (Petr Berka) Experiments with various data sources Decision trees from fuzzy attributes Better estimation of relevant questions count