Methodology Qiang Yang, MTM521 Material
A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all questions a user might pose as queries 2. Create dataset for study (from Data Warehouse, Web site, surveys) 3. Data Cleaning and Preprocessing: 4. Data Reduction and projection 5. Choose Data Mining task: blackbox or whitebox? Classification or clustering? 6. Choose Data Mining algorithms: 7. Use algorithms to perform task 8. Interpret, evaluation and cross validation, and iterate thru 1-7 if necessary 9. Deploy: integrate into operational systems, feedback and revise goals and redo 1-9.
Case Study: German Bank Credit Application Bank credit assessment Decision: Approval of loan or not approval of loan Usage: Automatic Online Screening or Human assistant Objective: Accurate prediction of values Give reasons behind decision is important
Potential Queries Who are likely to be approved loan? What are the most important characteristics of an applicant to look at? What are the most indicative features for yes/no answers What subset of customers to market to? And what are the associated profit? Added: what advice to give to applicant to improve chance in future?
Create Data Set for Study Access to bank data warehouse or conduct a customer survey Cost of obtaining data must be factored in? Likeliness of obtaining quality data in a limited amount of time?
Questions to be Asked Attribute 1: (qualitative) Status of existing checking account A11 :... < 0 DM A12 : 0 <=... < 200 DM A13 :... >= 200 DM / salary assignments for at least 1 year A14 : no checking account Attribute 2: (numerical) Duration in month Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others Attribute 5: (numerical) Credit amount Attibute 6: (qualitative) Savings account/bonds A61 :... < 100 DM A62 : 100 <=... < 500 DM A63 : 500 <=... < 1000 DM A64 :.. >= 1000 DM A65 : unknown/ no savings account
Data Cleaning and Preprocessing: What to do with missing values? How to fill in missing values and identify and correct incorrect values? Do we know the cost of classification mistakes? Do we know the cost of obtaining each feature? How do we reduce noise? What are the sources of noise for each attribute?
Rudimentary Analysis What is the data distribution? How can you view data from different angles? What does the rudimentary data analysis tell you? Are you satisfied with the analysis? Are there more queries that you cannot answer through this analysis?
Data reduction How many data features do we want in the end? Is it a data reduction problem or data transformation problem? Is it supervised data reduction or unsupervised data reduction problem? Is it linear data reduction or nonlinear data reduction problem?
Choose data mining task Do we apply rule-based methods for better understanding? Do we apply K Nearest neighbor methods for dense data sets? Do we apply SVM methods for accuracy but for black-box models? Is a final result (yes/no) important or the action important (what to do to reduce customer likelihood of being rejected?)
Use Algorithm to Perform Task Which hardware platform to use? Which software platform to use? Is speed and scale more important than visual effects? Is data porting issue important? Is API important or final answer important? How much does each package cost?
Evaluation Do we have separate training and testing data? Is data scarce? What kind of cross validation do we use? N folds, N=? Bootstrapping or not? Is ranking important (lift, ROC) or confusion matrix important?
Interpretation What does the results mean? Do we need to support causal effect of the final decisions? Do we need to go back to experts in the domain of application? Do we need visual effects or ranking of final results?
Iteration After obtaining one set of results, do we need to return to the beginning to revise our objectives and obtain new data? How many iterations are needed? Is the process a one shot or continuous process?
Deployment Issues Do we need to integrate with a real online banking system? Do we need to provide API for the software? Do we need to use new data to supplement training data set? If so, how often?