Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages
結束 4-2Contents Reviews data mining tools Compares data mining perspectives Discusses data mining functions Presents four sets of data used to demonstrate tools in subsequent chapters Shows the Enterprise Miner structure for data mining analysis in the appendix
結束 4-3 Data mining applications Automobile insurance company: Fraud detection Business applications: loan evaluation, customer segmentation, employee evaluation… Data mining tools categorized by the tasks of classification, estimation, prediction, clustering, and summarization. Classification, estimation, prediction are predictive, while clustering and summarization are descriptive.
結束 4-4History Statistics AI: genetic algorithms, neural networks analogies with biology memory-based reasoning link analysis from graph theory See table. 4.1
結束 4-5 Data mining perspectives Methods can be viewed from different perspectives, data mining methods include: Cluster analysis (Chapter 5) Regression of various forms (best fit methods, chapter 6) Discriminant analysis (use of regression for classification, chapter 6) Line fitting through the operations research tool of multiple objective linear programming (Chapter 9) AI: ANN (chapter 7) Rule induction (decision trees, chapter 8) Genetic algorithms (supplement) See page 55 for more descriptions
結束 4-6Techniques Statistical Market-Basket Analysis - find groups of items Memory-Based Reasoning - case based Cluster Detection - undirected (quantitative) Artificial Intelligence Link Analysis - MCI ’ s Friends & Family Decision Trees, Rule Induction - production rule Neural Networks - automatic pattern detection Genetic Algorithms - keep best parameters
結束 4-7Models Regression:Y = a + bX Classification:assign new record to class Predictive:assign value to new record Clustering:groups for data Time-series:assign future value Links:patterns in data
結束 4-8Fitting Underfitting: not enough detail leave out important variables Overfitting: too much detail memorizes training set, but doesn ’ t help with new data data set too small redundancy in data
結束 4-9 Comparison of Features RulesNeural NetCaseBaseGenetic Noisy dataGoodVery goodGoodVery good Missing dataGood Very goodGood Large setsVery goodPoorGood Different typesGoodNumericalVery goodTransform AccuracyHighVery highHigh ExplanationVery goodPoorVery goodGood IntegrationGood Very good EaseEasyDifficultEasyDifficult
結束 4-10 Data Mining Functions Classification Identify categories in data Prediction Formula to predict future observations Association Rules using relationships among entities Detection Anomalies (unusual) & irregularities (fraud detection)
結束 4-11 Financial Applications TechniqueApplicationProblem Type Neural netForecast stock pricePrediction NN, Rule Forecast bankruptcy Fraud detection Prediction Detection NN, CaseForecast interest ratePrediction NN, visualLate loan detectionDetection Rule Credit assessment Risk classification Prediction Classification Rule, Case Corporate bond rate ( 公司債 ) Prediction
結束 4-12 Telecom Applications TechniqueApplicationProblem Type Neural net, Rule induction Forecast network behavior. Prediction Rule induction Churn Fraud detection Classification Detection Case basedCall trackingClassification
結束 4-13 Marketing Applications TechniqueApplicationProblem Type Rule induction Market segment Cross-selling Classification Association Rule induction, visual Lifestyle analysis Performance analysis. Classification Association Rule induction, genetic, visual Reaction to promotion Prediction Case basedOnline sales supportClassification
結束 4-14 Web Applications TechniqueApplicationProblem Type Rule induction, Visualization User browsing similarity analysis. Classification, Association Rule-based heuristics Web page content similarity Association
結束 4-15 Other Applications TechniqueApplicationProblem Type Neural netSoftware costDetection Neural net, rule induction Litigation assessmentPrediction Rule induction Insurance fraud Healthcare except. Detection Case based Insurance claim Software quality Prediction Classification Genetic algorithmBudget spendingClassification
結束 4-16 Data Sets Loan Applications classification Job Applications classification Insurance Fraud detection Expenditure Data prediction
結束 4-17 Loan Data 650 observations OUTCOMES (binary): On-timecost of error: $300 Late (default)cost of error: $2,000 Variables Age, Income, Assets, Debts, Want, Credit Credit ordinal Transform: Assets, Debts, & Want →Risk
結束 4-18 Job Application Data 500 observations OUTCOMES (ordinal): Unacceptable Minimal Acceptable Excellent Variables Age, State, Degree, Major, Experience State nominal; degree & major ordinal State is superfluous
結束 4-19 Insurance Claim Data 5000 observations OUTCOMES (binary): OKcost of error $500 Fraudulentcost of error $2,500 Variables Age, Gender, Claim, Tickets, Prior claims, Attorney Gender & attorney nominal, tickets & prior claims categorical
結束 4-20 Expenditure Data 10,000 observations OUTCOMES: Could predict response in a number of categories Others Variables: Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards Churn, proportion of income spent on seven categories