Overview of Methods Data mining techniques What techniques do, examples, advantages & disadvantages
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-2 History Statistics AI: –genetic algorithms, neural networks analogies with biology –memory-based reasoning –link analysis from graph theory
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-3 Techniques Statistical –Market-Basket Analysis - find groups of items –Memory-Based Reasoning - case based –Cluster Detection - undirected (quantitative MBA) Artificial Intelligence –Link Analysis - MCI’s Friends & Family –Decision Trees, Rule Induction - production rule –Neural Networks - automatic pattern detection –Genetic Algorithms - keep best parameters
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-4 Models Regression:Y = a + bX Classification:assign new record to class Predictive:assign value to new record Clustering:groups for data Time-series:assign future value Links:patterns in data
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-5 Fitting Underfitting: not enough detail –leave out important variables Overfitting: too much detail –memorizes training set, but doesn’t help with new data data set too small redundancy in data
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-6 Comparison of Features RulesNeural NetCaseBaseGenetic Noisy dataGoodVery goodGoodVery good Missing dataGood Very goodGood Large setsVery goodPoorGood Different typesGoodNumericalVery goodTransform AccuracyHighVery highHigh ExplanationVery goodPoorVery goodGood IntegrationGood Very good EaseEasyDifficultEasyDifficult
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-7 Data Mining Functions Classification –Identify categories in data Prediction –Formula to predict future observations Association –Rules using relationships among entities Detection –Anomalies & irregularities (fraud detection)
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-8 Financial Applications TechniqueApplicationProblem Type Neural netForecast stock pricePrediction NN, RuleForecast bankruptcy Fraud detection Prediction Detection NN, CaseForecast interest ratePrediction NN, visualLate loan detectionDetection RuleCredit assessment Risk classification Prediction Classification Rule, CaseCorporate bond ratePrediction
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-9 Telecom Applications TechniqueApplicationProblem Type Neural net, Rule induct Forecast network behav. Prediction Rule inductChurn Fraud detection Classification Detection Case basedCall trackingClassification
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-10 Marketing Applications TechniqueApplicationProblem Type Rule inductMarket segment Cross-selling Classification Association Rule induct, visualLifestyle analysis Performance analy. Classification Association Rule induct, genetic, visual Reaction to promotion Prediction Case basedOnline sales supportClassification
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-11 Web Applications TechniqueApplicationProblem Type Rule induct, Visualization User browsing similarity analy. Classification, Association Rule-based heuristics Web page content similarity Association
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-12 Other Applications TechniqueApplicationProblem Type Neural netSoftware costDetection Neural net, rule induct Litigation assessment Prediction Rule inductInsurance fraud Healthcare except. Detection Case basedInsurance claim Software quality Prediction Classification Genetic algor.Budget spendingClassification
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-13 Data Sets Loan Applications –classification Job Applications –classification Insurance Fraud –detection Expenditure Data –prediction
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-14 Loan Data 650 observations OUTCOMES (binary): –On-timecost of error: $300 –Late (default)cost of error: $2,000 Variables –Age, Income, Assets, Debts, Want, Credit Credit ordinal –Transform: Assets, Debts, & Want →Risk
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-15 Job Application Data 500 observations OUTCOMES (ordinal): –Unacceptable –Minimal –Acceptable –Excellent Variables –Age, State, Degree, Major, Experience State nominal; degree & major ordinal State is superfluous
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-16 Insurance Claim Data 5000 observations OUTCOMES (binary): –OKcost of error $500 –Fraudulentcost of error $2,500 Variables –Age, Gender, Claim, Tickets, Prior claims, Attorney Gender & attorney nominal, tickets & prior claims categorical
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-17 Expenditure Data 10,000 observations OUTCOMES: –Could predict response in a number of categories –Others Variables: –Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards –Churn, proportion of income spent on seven categories