Chapter 2 Data Mining Processes and Knowledge Discovery

Chapter 2 Data Mining Processes and Knowledge Discovery
Identify actionable results

Contents Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can be used in data mining studies Discusses each phase in detail Gives an example illustration Discusses a knowledge discovery process

CRISP-DM Cross-Industry Standard Process for Data Mining
One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology

CRISP-DM Phases Business (or problem) understanding Data understanding
A systematic process to try to make sense of the massive amounts of data generated from daily operations. Data preparation Transform & create data set for modeling Modeling Evaluation Check good models, evaluate to assure nothing missing Deployment

Business Understanding
Solve a specific problem Determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Clear definition helps Measurable success criteria Convert business objectives to set of data-mining goals What to achieve in technical terms, such as What types of customers are interested in each of our products? What are typical profiles of customers …

Data Understanding Initial data collection, data description, data exploration, and the verification of data quality. Three issues considered in data selection: Set up a concise and clear description of the problem. For example, a retail DM project may seek to identify spending behaviors of female shoppers who purchase seasonal clothes. Identify the relevant data for the problem description, such demographical, credit card transactional, financial data… Select variables for the relevant important for the project.

Data Understanding (cont.)
Data types: Demographic data (income, education, age …) Socio-graphic data (hobby, club membership,…) Transactional data (sales record, credit card spending…) Quantitative data: are measurable using numerical values) Qualitative data: known as categorical data, contains both nominal and ordinal data. (see also page. 22) Related data: Can come from many sources? Internal ERP (or MIS) Data Warehouse External Government data Commercial data Created Research

Data Preparation Once data sources available are identified, the data need to be selected, cleaned, built into the desired and formatted forms. Clean data: Formats, gaps, filters outliers & redundancies (see page .22) Unified numerical scales Nominal data Code (such gender data, male and female) Ordinal data Nominal code or scale (excellent, fair, poor) Cardinal data (Categorical, A, B, C levels)

Types of Data Type Features Synonyms Numerical Continuous Range
Integer Binary Yes/No Flag Categorical Finite Set Date/Time String Typeless Text Range: Numeric vales (integer, real, or date/time) Set: Data with distinct multiple value (numeric, string, or data/time) Typeless: for other types of data

Data Preparation (Cont.)
Several statistical method and visualization tools can be used to preprocess the selected data. Such max, min, mean, and mode can be used to aggregate or smooth the data. Scatter plots and box plots can be used to filter outliers. More advanced techniques, such as regression analysis, cluster analysis, decision tree, or hierarchical analysis may be applied in data preprocessing. In some cases, data preprocessing could take over 50% of the time of the entire data mining process. Shortening data processing time can reduce much of the total computation time in data mining.

Data Preparation – data transformation
Data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the data analysis. Data transformation can be used to Transform from numerical to numerical scales, to shrink or enlarge the given data. Such as (x-min)/max-min) to shrink the data into the interval [0,1]. Recode categorical data to numerical scales. Categorical data can be ordinal (less, moderate, strong) and nominal (red, yellow, blue..). Such 1=yes, 0=no. see also page. 24. See page. 24 for more details.

Modeling Data Treatment
Data modeling is where the data mining software is used to generate results for various situations. Data visualization and cluster analysis are useful for initial analysis. Depending on the data type, if the task is to group data, discriminant analysis is applied. If the purpose is estimation, regression is appropriate the data are continuous (and logistic regression is not). Neural networks could be applied for both tasks. Data Treatment Training set for development of the model. Test set for testing the model that is built. Maybe others for refining the model

Data mining techniques
Association: the relationship of a particular item in a data transaction on other items in the same transaction is used to predict patterns. See also page 25 for example. Classification: the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power(C4.5). Mathematical modeling is often used to construct classification methods are binary decision trees (CART), neural networks (nonlinear), linear programming (boundary), and statistics. See also page. 25, 26 for more explanations

Data mining techniques (Cont.)
Clustering: taking ungrouped data and uses automatic techniques to put this data into groups. Clustering is unsupervised and does not require a learning set. (Chapter 5) Predictions: is related to regression technique, to discover the relationship between the dependent and independent variables. Sequential patterns: seeks to find similar patterns in data transaction over a business period. The mathematical models behind sequential patterns are logic rules, fuzzy logic, and so on. Similar time sequences: applied to discover sequences similar to a known sequence over both past and current business periods.

PDCA CRISP-DM Evaluation Does model meet business objectives?
Any important business objectives not addressed? Does model make sense? Is model actionable? PDCA CRISP-DM

Deployment DM can be used to verify previously held hypotheses or for knowledge discovery. DM models can be applied to business purposes , including prediction or identification of key situations Ongoing monitoring & maintenance Evaluate performance against success criteria Market reaction & competitor changes (remodeling or fine tune)

Training set for computer purchase
Example Training set for computer purchase 16 records 5 attributes Goal Find classifier for consumer behavior

Database (1st half) Case Age Income Student Credit Gender Buy? A1
31-40 High No Fair Male Yes A2 >40 Medium Female A3 Low A4 Excellent A5 ≤30 A6 A7 A8

Database (2nd half) Case Age Income Student Credit Gender Buy? A9
31-40 High Yes Fair Male A10 ≤30 No A11 Excellent Female A12 >40 Low A13 Medium A14 A15 Unknown A16 N/A

Data Selection Gender has weak relationship with purchase
Based on correlation Drop gender Selected Attribute Set {Age, Income, Student, Credit}

Data Preprocessing Income unknown in Case 15 Credit not available in Case 16 Drop these noisy cases

Assign numerical values to each attribute
Data Transformation Assign numerical values to each attribute Age: ≤30 = = 2 >40 = 1 Income: High = 3 Medium = 2 Low = 1 Student: Yes = 2 No = 1 Credit: Excellent = 2 Fair = 1

Data Mining Categorize output Conduct analysis Confusion matrix
Buys = C1 Doesn’t buy = C2 Conduct analysis Model says A8, A10 don’t buy; rest do Of the actual yes, 7 correct and 1 not Of the actual no, 2 correct Confusion matrix

Data Interpretation and Test Data Set
Test on independent data Case Actual Model B1 Yes Yes (1) B2 Yes (2) B3 Yes (3) B4 Yes (4) B5 Yes (5) B6 Yes (6) B7 Yes (7) B8 (do not) No B9 B10 (do not)

Confusion Matrix Model Buy Model Not Totals Actual Buy 7 Actual Not 1
Actual Not 1 2 3 8 10 right

Measures Correct classification rate 9/10 = 0.90 Cost function
cost of error: model says buy, actual no $20 model says no, actual buy $200 1 x $ x $200 = $20

Goals Avoid broad concepts: Narrow and specify:
Gain insight; discover meaningful patterns; learn interesting things Can’t measure attainment Narrow and specify: Identify customers likely to renew; reduce churn; Rank order by propensity (favor) to…;

Prescription: what should be done
Goals Description: what is understand explain discover knowledge Prescription: what should be done classify predict

Gain understanding: Method A better
Goal Method A: four rules, explains 70% Method B: fifty rules, explains 72% BEST? Gain understanding: Method A better minimum description length (MDL) Reduce cost of mailing: Method B better

Measurement Accuracy Confidence levels Comprehensibility
How well does model describe observed data? Confidence levels a proportion of the time between lower and upper limits Comprehensibility Whole or parts?

Classification & prediction:
Measuring Predictive Classification & prediction: error rate = incorrect/total requires evaluation set be representative Estimators predicted - actual (MAD, MSE, MAPE) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off

Population - entire group studied Sample - subset from population
Statistics Population - entire group studied Sample - subset from population Bias - difference between sample average & population average mean, median, mode distribution significance correlation, regression (hamming distance)

Classification Models
LIFT = probability in class by sample divided by probability in class by population if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 Best lift not necessarily best need sufficient sample size as confidence increase.

Lift Chart

Measuring Impact Ideal - $ (NPV) because of expenditure
Mass mailing may be better Depends on: fixed cost cost per recipient cost per respondent value of positive response

Bottom Line Return on investment

Example Application Telephone industry Problem: Unpaid bills
Data mining used to develop models to predict nonpayment as early as possible See page. 27

Knowledge Discovery Process
1 Data Selection Learning the application domain Creating target data set 2 Data Preprocessing Data cleaning & preprocessing 3 Data Transformation Data reduction & projection 4 Data Mining Choosing function Choosing algorithms Data mining 5 Data Interpretation Interpretation Using discovered knowledge

1: Business Understanding
Predict which customers would be insolvent In time for firm to take preventive measures (and avert losing good customers) Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

Static customer information available in files
2: Data Understanding Static customer information available in files Bills, payments, usage Used data warehouse to gather & organize data Coded to protect customer privacy

Creating Target Data Set
Customer files Customer information Disconnects Reconnections Time-dependent data Bills Payments Usage 100,000 customers over 17-month period Stratified (hierarchical) sampling to assure all groups appropriately represented

3: Data Preparation Filtered out incomplete data
Deleted inexpensive calls Reduced data volume about 50% Low number of fraudulent cases Cross-checked with phone disconnects Lagged data made synchronization necessary

Data Reduction & Projection
Information grouped by account Customer data aggregated by 2-week periods Discriminant analysis on 23 categories Calculated average owed by category (significant) Identified extra charges (significant) Investigated payment by installments (not significant)

Choosing Data Mining Function
Classes: Most possibly solvent (99.3%) Most possibly insolvent (0.7%) Costs of error widely different New data set created through stratified sampling Retained all insolvent Altered distribution to 90% solvent Used 2,066 cases total Critical period identified Last 15 two-week periods before service interruption Variables defined by counting measures in two-week periods 46 variables as candidate discriminant factors

4: Modeling Discriminant Analysis Decision Trees Neural Networks
Linear model SPSS – stepwise forward selection Decision Trees Rule-based classifier, C5, C4.5 Neural Networks Nonlinear model

Data Mining Training set about 2/3rds Rest test Discriminant analysis
Used 17 variables Equal costs – correct Unequal costs – correct Rule-based – correct Neural network – correct

5: Evaluation 1st objective to maximize accuracy of predicting insolvent customers Decision tree classifier best 2nd objective to minimize error rate for solvent customers Neural network model close to Decision tree Used all 3 on case-by-case basis

Coincidence Matrix – Combined Models
Model insolvent Model solvent Unclass Totals Actual insolvent 19 17 28 64 Actual solvent 1 626 27 654 20 643 91 718

Every customer examined using all 3 algorithms
6: Implementation Every customer examined using all 3 algorithms If all 3 agreed, used that classification If disagreement, categorized as unclassified Correct on test data 0.898 Only 1 actually solvent customer would have been disconnected

Chapter 2 Data Mining Processes and Knowledge Discovery

Similar presentations

Presentation on theme: "Chapter 2 Data Mining Processes and Knowledge Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Data Mining Processes and Knowledge Discovery

Similar presentations

Presentation on theme: "Chapter 2 Data Mining Processes and Knowledge Discovery"— Presentation transcript:

Similar presentations

About project

Feedback