Decision Tree Algorithms Rule Based Suitable for automatic generation.

Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-2 Decision trees Logical branching Historical: –ID3 – early rule- generating system Branches: –Different possible values Nodes: –From which branches emanate

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-3 Goal-Driven Data Mining Define goal –Identify fraudulent cases Develop rules identifying attributes attaining that goal –IF attorney = Smith, THEN better check

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-4 Tree Structure Sorts out data –IF THEN rules –Loan variables Age: {young, middle, old} Income: {low, average, high} Risk: {low, medium, high} Exhaustive tree enumerates all combinations –81 combinations – classify all

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-5 Types of Trees Classification tree –Variable values classes –Finite conditions Regression tree –Variable values continuous numbers –Prediction or estimation

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-6 Rule Induction Automatically process data –Classification (logical, easier) –Regression (estimation, messier) Search through data for patterns & relationships –Pure knowledge discovery Assumes no prior hypothesis Disregards human judgment

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-8 Combinations VariableValueCasesOTLatePr(OT) AgeYoung12840.67 Middle5410.80 Old3301.00 IncomeLow5320.60 Average9720.78 High6510.83 RiskHigh9540.55 Average1010.00 Low10 01.00

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-9 Basis for Classification If a category has all outcomes of a certain kind, that makes a good rule –IF income = High, they always paid ENTROPY: Measure of content –Actually measure of randomness

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-10 Entropy formula Information = -{p/(p+n)}log 2 {p/(p+n)} -{n/(p+n)}log 2 {n/(p+n)} The lower the measure, the greater the information content Can use to automatically select variable with most productive rule potential

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-11 Entropy Young - 8/12 x -0.390 – 4/12 x -0.528 x 12/20: 0.551 Middle - 4/5 x -0.258 – 1/5 x -0.464 x 5/20: 0.180 Old - 3/3 x 0 – 0/3 x 0 x 3/20:0.000 SUM0.731 Income0.782 Risk0.446

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-13 All Rules 1.IF Risk=LowOT 2.IF Risk NOT Low & Age=MiddleLate 3.IF Risk NOT Low & Age NOT Middle & Income=HighLate 4.ELSEOT

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-14 Sample Case Age 36Middle Income $70K/yearAverage Risk: –Assets $42K –Debts $40K –Wants $5KAverage Rule 2 applies, says Late

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-15 Fuzzy Decision Trees Have assumed distinct (crisp) outcomes Many data points not that clear Fuzzy: Membership function represents belief (between 0 and 1) Fuzzy relationships have been incorporated in decision tree algorithms

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-16 Fuzzy Example Age Young 0.3 Middle 0.9Old 0.2 IncomeLow 0.0Average 0.8High 0.3 RiskLow 0.1Average 0.8High 0.3 Definitions: –Sum will not necessarily equal 1.0 –If ambiguous, select alternative with larger membership value –Aggregate with mean

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-17 Fuzzy Model IF Risk=Low Then OT –Membership function: 0.1 IF Risk NOT Low & Age=Middle Then Late –Risk MAX(0.8, 0.3) –Age 0.9 –Membership function: Mean = 0.85

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-18 Fuzzy Model cont. IF Risk NOT Low & Age NOT Middle & Income=High THEN Late –Risk MAX(0.8, 0.3)0.8 –Age MAX(0.3, 0.2)0.3 –Income 0.3 –Membership function: Mean = 0.433

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-19 Fuzzy Model cont. IF Risk NOT Low & Age NOT Middle & Income NOT High THEN Late –Risk MAX(0.8, 0.3)0.8 –Age MAX(0.3, 0.2)0.3 –Income MAX(0.0, 0.8)0.8 –Membership function: Mean = 0.633

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-22 Inventory Prediction Groceries –Maybe over 100,000 SKUs –Barcode data input Data mining to discover patterns –Random sample of over 1.6 million records –30 months –95 outlets –Test sample 400,000 records Rule induction more workable than regression –28,000 rules –Very accurate, up to 27% improvement

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-23 Clinical Database Headache –Over 60 possible causes Exclusive reasoning uses negative rules –Use when symptom absent Inclusive reasoning uses positive rules Probabilistic rule induction expert system –Headache: Training sample over 50,000 cases, 45 classes, 147 attributes –Meningitis: 1200 samples on 41 attributes, 4 outputs

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-24 Clinical Database Used AQ15, C4.5 –Average accuracy 82% Expert System –Average accuracy 92% Rough Set Rule System –Average accuracy 70% Using both positive & negative rules from rough sets –Average accuracy over 90%

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-25 Software Development Quality Telecommunications company Goal: find patterns in modules being developed likely to contain faults discovered by customers –Typical module several million lines of code –Probability of fault averaged 0.074 Apply greater effort for those –Specification, testing, inspection

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-26 Software Quality Preprocessed data Reduced data Used CART –(Classification & Regression Trees) –Could specify prior probabilities First model 9 rules, 6 variables –Better at cross-validation –But variable values not available until late Second model 4 rules, 2 variables –About same accuracy, data available earlier

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-27 Decision Trees Very effective & useful Automatic machine learning –Thus unbiased (but omit judgment) Can handle very large data sets –Not affected much by missing data Lots of software available

Decision Tree Algorithms Rule Based Suitable for automatic generation.

Similar presentations

Presentation on theme: "Decision Tree Algorithms Rule Based Suitable for automatic generation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree Algorithms Rule Based Suitable for automatic generation.

Similar presentations

Presentation on theme: "Decision Tree Algorithms Rule Based Suitable for automatic generation."— Presentation transcript:

Similar presentations

About project

Feedback