Chapter 2 Data Mining Processes and Knowledge Discovery

Slides:



Advertisements
Similar presentations
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
Advertisements

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Introduction to Data Mining with XLMiner
Measurement in Survey Research Developing Questionnaire Items with Respect to Content and Analysis.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Basic Data Mining Techniques
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Data Mining By Archana Ketkar.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Data Analysis Statistics. Inferential statistics.
Knowledge is Power Marketing Information System (MIS) determines what information managers need and then gathers, sorts, analyzes, stores, and distributes.
Data Mining – Intro.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Overview DM for Business Intelligence.
1 Data Mining DT211 4 Refer to Connolly and Begg 4ed.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Chapter 1: Introduction to Statistics
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Chapter 11 LEARNING FROM DATA. Chapter 11: Learning From Data Outline  The “Learning” Concept  Data Visualization  Neural Networks The Basics Supervised.
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Chapter 6 Regression Algorithms in Data Mining
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Analyzing and Interpreting Quantitative Data
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
HOW TO WRITE RESEARCH PROPOSAL BY DR. NIK MAHERAN NIK MUHAMMAD.
Neural Networks Automatic Model Building (Machine Learning) Artificial Intelligence.
 Mail Order Company in USA › Would like to find out if there is a way › To reduce mailing cost › By analyzing the past data.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Examining Relationships in Quantitative Research
Copyright 2000 Prentice Hall5-1 Chapter 5 Marketing Information and Research: Analyzing the Business Environment.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining Processes Identify actionable results.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
Chapter 6: Analyzing and Interpreting Quantitative Data
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Data Mining and Decision Support
Data Mining Copyright KEYSOFT Solutions.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Appendix I A Refresher on some Statistical Terms and Tests.
Data Mining – Intro.
DATA MINING © Prentice Hall.
Regression Analysis Module 3.
Analyzing and Interpreting Quantitative Data
Week 11 Knowledge Discovery Systems & Data Mining :
MIS2502: Data Analytics Clustering and Segmentation
Data Warehousing Data Mining Privacy
Group 9 – Data Mining: Data
Presentation transcript:

Chapter 2 Data Mining Processes and Knowledge Discovery Identify actionable results

Contents Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can be used in data mining studies Discusses each phase in detail Gives an example illustration Discusses a knowledge discovery process

CRISP-DM Cross-Industry Standard Process for Data Mining One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology

CRISP-DM Phases Business (or problem) understanding Data understanding A systematic process to try to make sense of the massive amounts of data generated from daily operations. Data preparation Transform & create data set for modeling Modeling Evaluation Check good models, evaluate to assure nothing missing Deployment

Business Understanding Solve a specific problem Determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Clear definition helps Measurable success criteria Convert business objectives to set of data-mining goals What to achieve in technical terms, such as What types of customers are interested in each of our products? What are typical profiles of customers …

Data Understanding Initial data collection, data description, data exploration, and the verification of data quality. Three issues considered in data selection: Set up a concise and clear description of the problem. For example, a retail DM project may seek to identify spending behaviors of female shoppers who purchase seasonal clothes. Identify the relevant data for the problem description, such demographical, credit card transactional, financial data… Select variables for the relevant important for the project.

Data Understanding (cont.) Data types: Demographic data (income, education, age …) Socio-graphic data (hobby, club membership,…) Transactional data (sales record, credit card spending…) Quantitative data: are measurable using numerical values) Qualitative data: known as categorical data, contains both nominal and ordinal data. (see also page. 22) Related data: Can come from many sources? Internal ERP (or MIS) Data Warehouse External Government data Commercial data Created Research

Data Preparation Once data sources available are identified, the data need to be selected, cleaned, built into the desired and formatted forms. Clean data: Formats, gaps, filters outliers & redundancies (see page .22) Unified numerical scales Nominal data Code (such gender data, male and female) Ordinal data Nominal code or scale (excellent, fair, poor) Cardinal data (Categorical, A, B, C levels)

Types of Data Type Features Synonyms Numerical Continuous Range Integer Binary Yes/No Flag Categorical Finite Set Date/Time String Typeless Text Range: Numeric vales (integer, real, or date/time) Set: Data with distinct multiple value (numeric, string, or data/time) Typeless: for other types of data

Data Preparation (Cont.) Several statistical method and visualization tools can be used to preprocess the selected data. Such max, min, mean, and mode can be used to aggregate or smooth the data. Scatter plots and box plots can be used to filter outliers. More advanced techniques, such as regression analysis, cluster analysis, decision tree, or hierarchical analysis may be applied in data preprocessing. In some cases, data preprocessing could take over 50% of the time of the entire data mining process. Shortening data processing time can reduce much of the total computation time in data mining.

Data Preparation – data transformation Data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the data analysis. Data transformation can be used to Transform from numerical to numerical scales, to shrink or enlarge the given data. Such as (x-min)/max-min) to shrink the data into the interval [0,1]. Recode categorical data to numerical scales. Categorical data can be ordinal (less, moderate, strong) and nominal (red, yellow, blue..). Such 1=yes, 0=no. see also page. 24. See page. 24 for more details.

Modeling Data Treatment Data modeling is where the data mining software is used to generate results for various situations. Data visualization and cluster analysis are useful for initial analysis. Depending on the data type, if the task is to group data, discriminant analysis is applied. If the purpose is estimation, regression is appropriate the data are continuous (and logistic regression is not). Neural networks could be applied for both tasks. Data Treatment Training set for development of the model. Test set for testing the model that is built. Maybe others for refining the model

Data mining techniques Association: the relationship of a particular item in a data transaction on other items in the same transaction is used to predict patterns. See also page 25 for example. Classification: the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power(C4.5). Mathematical modeling is often used to construct classification methods are binary decision trees (CART), neural networks (nonlinear), linear programming (boundary), and statistics. See also page. 25, 26 for more explanations

Data mining techniques (Cont.) Clustering: taking ungrouped data and uses automatic techniques to put this data into groups. Clustering is unsupervised and does not require a learning set. (Chapter 5) Predictions: is related to regression technique, to discover the relationship between the dependent and independent variables. Sequential patterns: seeks to find similar patterns in data transaction over a business period. The mathematical models behind sequential patterns are logic rules, fuzzy logic, and so on. Similar time sequences: applied to discover sequences similar to a known sequence over both past and current business periods.

PDCA CRISP-DM Evaluation Does model meet business objectives? Any important business objectives not addressed? Does model make sense? Is model actionable? PDCA CRISP-DM

Deployment DM can be used to verify previously held hypotheses or for knowledge discovery. DM models can be applied to business purposes , including prediction or identification of key situations Ongoing monitoring & maintenance Evaluate performance against success criteria Market reaction & competitor changes (remodeling or fine tune)

Training set for computer purchase Example Training set for computer purchase 16 records 5 attributes Goal Find classifier for consumer behavior

Database (1st half) Case Age Income Student Credit Gender Buy? A1 31-40 High No Fair Male Yes A2 >40 Medium Female A3 Low A4 Excellent A5 ≤30 A6 A7 A8

Database (2nd half) Case Age Income Student Credit Gender Buy? A9 31-40 High Yes Fair Male A10 ≤30 No A11 Excellent Female A12 >40 Low A13 Medium A14 A15 Unknown A16 N/A

Data Selection Gender has weak relationship with purchase Based on correlation Drop gender Selected Attribute Set {Age, Income, Student, Credit}

Data Preprocessing Income unknown in Case 15 Credit not available in Case 16 Drop these noisy cases

Assign numerical values to each attribute Data Transformation Assign numerical values to each attribute Age: ≤30 = 3 31-40 = 2 >40 = 1 Income: High = 3 Medium = 2 Low = 1 Student: Yes = 2 No = 1 Credit: Excellent = 2 Fair = 1

Data Mining Categorize output Conduct analysis Confusion matrix Buys = C1 Doesn’t buy = C2 Conduct analysis Model says A8, A10 don’t buy; rest do Of the actual yes, 7 correct and 1 not Of the actual no, 2 correct Confusion matrix

Data Interpretation and Test Data Set Test on independent data Case Actual Model B1 Yes Yes (1) B2 Yes (2) B3 Yes (3) B4 Yes (4) B5 Yes (5) B6 Yes (6) B7 Yes (7) B8 (do not) No B9 B10 (do not)

Confusion Matrix Model Buy Model Not Totals Actual Buy 7 Actual Not 1 Actual Not 1 2 3 8 10 right

Measures Correct classification rate 9/10 = 0.90 Cost function cost of error: model says buy, actual no $20 model says no, actual buy $200 1 x $20 + 0 x $200 = $20

Goals Avoid broad concepts: Narrow and specify: Gain insight; discover meaningful patterns; learn interesting things Can’t measure attainment Narrow and specify: Identify customers likely to renew; reduce churn; Rank order by propensity (favor) to…;

Prescription: what should be done Goals Description: what is understand explain discover knowledge Prescription: what should be done classify predict

Gain understanding: Method A better Goal Method A: four rules, explains 70% Method B: fifty rules, explains 72% BEST? Gain understanding: Method A better minimum description length (MDL) Reduce cost of mailing: Method B better

Measurement Accuracy Confidence levels Comprehensibility How well does model describe observed data? Confidence levels a proportion of the time between lower and upper limits Comprehensibility Whole or parts?

Classification & prediction: Measuring Predictive Classification & prediction: error rate = incorrect/total requires evaluation set be representative Estimators predicted - actual (MAD, MSE, MAPE) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off

Population - entire group studied Sample - subset from population Statistics Population - entire group studied Sample - subset from population Bias - difference between sample average & population average mean, median, mode distribution significance correlation, regression (hamming distance)

Classification Models LIFT = probability in class by sample divided by probability in class by population if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 Best lift not necessarily best need sufficient sample size as confidence increase.

Lift Chart

Measuring Impact Ideal - $ (NPV) because of expenditure Mass mailing may be better Depends on: fixed cost cost per recipient cost per respondent value of positive response

Bottom Line Return on investment

Example Application Telephone industry Problem: Unpaid bills Data mining used to develop models to predict nonpayment as early as possible See page. 27

Knowledge Discovery Process 1 Data Selection Learning the application domain Creating target data set 2 Data Preprocessing Data cleaning & preprocessing 3 Data Transformation Data reduction & projection 4 Data Mining Choosing function Choosing algorithms Data mining 5 Data Interpretation Interpretation Using discovered knowledge

1: Business Understanding Predict which customers would be insolvent In time for firm to take preventive measures (and avert losing good customers) Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

Static customer information available in files 2: Data Understanding Static customer information available in files Bills, payments, usage Used data warehouse to gather & organize data Coded to protect customer privacy

Creating Target Data Set Customer files Customer information Disconnects Reconnections Time-dependent data Bills Payments Usage 100,000 customers over 17-month period Stratified (hierarchical) sampling to assure all groups appropriately represented

3: Data Preparation Filtered out incomplete data Deleted inexpensive calls Reduced data volume about 50% Low number of fraudulent cases Cross-checked with phone disconnects Lagged data made synchronization necessary

Data Reduction & Projection Information grouped by account Customer data aggregated by 2-week periods Discriminant analysis on 23 categories Calculated average owed by category (significant) Identified extra charges (significant) Investigated payment by installments (not significant)

Choosing Data Mining Function Classes: Most possibly solvent (99.3%) Most possibly insolvent (0.7%) Costs of error widely different New data set created through stratified sampling Retained all insolvent Altered distribution to 90% solvent Used 2,066 cases total Critical period identified Last 15 two-week periods before service interruption Variables defined by counting measures in two-week periods 46 variables as candidate discriminant factors

4: Modeling Discriminant Analysis Decision Trees Neural Networks Linear model SPSS – stepwise forward selection Decision Trees Rule-based classifier, C5, C4.5 Neural Networks Nonlinear model

Data Mining Training set about 2/3rds Rest test Discriminant analysis Used 17 variables Equal costs – 0.875 correct Unequal costs – 0.930 correct Rule-based – 0.952 correct Neural network – 0.929 correct

5: Evaluation 1st objective to maximize accuracy of predicting insolvent customers Decision tree classifier best 2nd objective to minimize error rate for solvent customers Neural network model close to Decision tree Used all 3 on case-by-case basis

Coincidence Matrix – Combined Models Model insolvent Model solvent Unclass Totals Actual insolvent 19 17 28 64 Actual solvent 1 626 27 654 20 643 91 718

Every customer examined using all 3 algorithms 6: Implementation Every customer examined using all 3 algorithms If all 3 agreed, used that classification If disagreement, categorized as unclassified Correct on test data 0.898 Only 1 actually solvent customer would have been disconnected