Download presentation
Presentation is loading. Please wait.
Published byIsaac Pierce Modified over 9 years ago
1
Decision Tree Algorithms Rule Based Suitable for automatic generation
2
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-2 Decision trees Logical branching Historical: –ID3 – early rule- generating system Branches: –Different possible values Nodes: –From which branches emanate
3
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-3 Goal-Driven Data Mining Define goal –Identify fraudulent cases Develop rules identifying attributes attaining that goal –IF attorney = Smith, THEN better check
4
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-4 Tree Structure Sorts out data –IF THEN rules –Loan variables Age: {young, middle, old} Income: {low, average, high} Risk: {low, medium, high} Exhaustive tree enumerates all combinations –81 combinations – classify all
5
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-5 Types of Trees Classification tree –Variable values classes –Finite conditions Regression tree –Variable values continuous numbers –Prediction or estimation
6
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-6 Rule Induction Automatically process data –Classification (logical, easier) –Regression (estimation, messier) Search through data for patterns & relationships –Pure knowledge discovery Assumes no prior hypothesis Disregards human judgment
7
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-7 Example Three variables: –Age –Income –Risk Outcomes: –On-time –Late
8
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-8 Combinations VariableValueCasesOTLatePr(OT) AgeYoung12840.67 Middle5410.80 Old3301.00 IncomeLow5320.60 Average9720.78 High6510.83 RiskHigh9540.55 Average1010.00 Low10 01.00
9
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-9 Basis for Classification If a category has all outcomes of a certain kind, that makes a good rule –IF income = High, they always paid ENTROPY: Measure of content –Actually measure of randomness
10
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-10 Entropy formula Information = -{p/(p+n)}log 2 {p/(p+n)} -{n/(p+n)}log 2 {n/(p+n)} The lower the measure, the greater the information content Can use to automatically select variable with most productive rule potential
11
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-11 Entropy Young - 8/12 x -0.390 – 4/12 x -0.528 x 12/20: 0.551 Middle - 4/5 x -0.258 – 1/5 x -0.464 x 5/20: 0.180 Old - 3/3 x 0 – 0/3 x 0 x 3/20:0.000 SUM0.731 Income0.782 Risk0.446
12
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-12 Rule 1.IF(Risk = Low) THEN OT 2.ELSE LATE
13
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-13 All Rules 1.IF Risk=LowOT 2.IF Risk NOT Low & Age=MiddleLate 3.IF Risk NOT Low & Age NOT Middle & Income=HighLate 4.ELSEOT
14
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-14 Sample Case Age 36Middle Income $70K/yearAverage Risk: –Assets $42K –Debts $40K –Wants $5KAverage Rule 2 applies, says Late
15
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-15 Fuzzy Decision Trees Have assumed distinct (crisp) outcomes Many data points not that clear Fuzzy: Membership function represents belief (between 0 and 1) Fuzzy relationships have been incorporated in decision tree algorithms
16
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-16 Fuzzy Example Age Young 0.3 Middle 0.9Old 0.2 IncomeLow 0.0Average 0.8High 0.3 RiskLow 0.1Average 0.8High 0.3 Definitions: –Sum will not necessarily equal 1.0 –If ambiguous, select alternative with larger membership value –Aggregate with mean
17
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-17 Fuzzy Model IF Risk=Low Then OT –Membership function: 0.1 IF Risk NOT Low & Age=Middle Then Late –Risk MAX(0.8, 0.3) –Age 0.9 –Membership function: Mean = 0.85
18
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-18 Fuzzy Model cont. IF Risk NOT Low & Age NOT Middle & Income=High THEN Late –Risk MAX(0.8, 0.3)0.8 –Age MAX(0.3, 0.2)0.3 –Income 0.3 –Membership function: Mean = 0.433
19
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-19 Fuzzy Model cont. IF Risk NOT Low & Age NOT Middle & Income NOT High THEN Late –Risk MAX(0.8, 0.3)0.8 –Age MAX(0.3, 0.2)0.3 –Income MAX(0.0, 0.8)0.8 –Membership function: Mean = 0.633
20
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-20 Fuzzy Model cont. Highest membership function is 0.633, for Rule 4 Conclusion: On-time
21
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-21 Applications Inventory Prediction Clinical Databases Software Development Quality
22
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-22 Inventory Prediction Groceries –Maybe over 100,000 SKUs –Barcode data input Data mining to discover patterns –Random sample of over 1.6 million records –30 months –95 outlets –Test sample 400,000 records Rule induction more workable than regression –28,000 rules –Very accurate, up to 27% improvement
23
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-23 Clinical Database Headache –Over 60 possible causes Exclusive reasoning uses negative rules –Use when symptom absent Inclusive reasoning uses positive rules Probabilistic rule induction expert system –Headache: Training sample over 50,000 cases, 45 classes, 147 attributes –Meningitis: 1200 samples on 41 attributes, 4 outputs
24
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-24 Clinical Database Used AQ15, C4.5 –Average accuracy 82% Expert System –Average accuracy 92% Rough Set Rule System –Average accuracy 70% Using both positive & negative rules from rough sets –Average accuracy over 90%
25
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-25 Software Development Quality Telecommunications company Goal: find patterns in modules being developed likely to contain faults discovered by customers –Typical module several million lines of code –Probability of fault averaged 0.074 Apply greater effort for those –Specification, testing, inspection
26
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-26 Software Quality Preprocessed data Reduced data Used CART –(Classification & Regression Trees) –Could specify prior probabilities First model 9 rules, 6 variables –Better at cross-validation –But variable values not available until late Second model 4 rules, 2 variables –About same accuracy, data available earlier
27
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-27 Decision Trees Very effective & useful Automatic machine learning –Thus unbiased (but omit judgment) Can handle very large data sets –Not affected much by missing data Lots of software available
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.