Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

ECG Signal processing (2)
Preparing Data for Quantitative Analysis
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Introduction to Data Mining with XLMiner
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining.
Basic Data Mining Techniques
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Data Mining – Intro.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Effort Reporting System Cost Transfer Demo A step-by-step guide to doing cost transfers (retros) in ERS.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Final Project and Term Paper Requirements Qiang Yang, MTM521 Material.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Show Me Potential Customers Data Mining Approach Leila Etaati.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining – Intro.
Chapter 6 Classification and Prediction
MIS 451 Building Business Intelligence Systems
5.01 Understand credit management.
Classification and Prediction
Machine Learning & Data Science
Intro to Machine Learning
Prepared by: Mahmoud Rafeek Al-Farra
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
5.01 Understand credit management.
MIS2502: Data Analytics Classification Using Decision Trees
Welcome! Knowledge Discovery and Data Mining
CSE591: Data Mining by H. Liu
K.S. School of Business Management (MIS)
Presentation transcript:

Methodology Qiang Yang, MTM521 Material

A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all questions a user might pose as queries 2. Create dataset for study (from Data Warehouse, Web site, surveys) 3. Data Cleaning and Preprocessing: 4. Data Reduction and projection 5. Choose Data Mining task: blackbox or whitebox? Classification or clustering? 6. Choose Data Mining algorithms: 7. Use algorithms to perform task 8. Interpret, evaluation and cross validation, and iterate thru 1-7 if necessary 9. Deploy: integrate into operational systems, feedback and revise goals and redo 1-9.

Case Study: German Bank Credit Application  Bank credit assessment  Decision: Approval of loan or not approval of loan  Usage: Automatic Online Screening or Human assistant Objective:  Accurate prediction of values  Give reasons behind decision is important

Potential Queries Who are likely to be approved loan? What are the most important characteristics of an applicant to look at? What are the most indicative features for yes/no answers What subset of customers to market to? And what are the associated profit? Added: what advice to give to applicant to improve chance in future?

Create Data Set for Study Access to bank data warehouse or conduct a customer survey  Cost of obtaining data must be factored in?  Likeliness of obtaining quality data in a limited amount of time?

Questions to be Asked Attribute 1: (qualitative) Status of existing checking account A11 :... < 0 DM A12 : 0 <=... < 200 DM A13 :... >= 200 DM / salary assignments for at least 1 year A14 : no checking account Attribute 2: (numerical) Duration in month Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others Attribute 5: (numerical) Credit amount Attibute 6: (qualitative) Savings account/bonds A61 :... < 100 DM A62 : 100 <=... < 500 DM A63 : 500 <=... < 1000 DM A64 :.. >= 1000 DM A65 : unknown/ no savings account

Data Cleaning and Preprocessing: What to do with missing values?  How to fill in missing values and identify and correct incorrect values? Do we know the cost of classification mistakes? Do we know the cost of obtaining each feature? How do we reduce noise? What are the sources of noise for each attribute?

Rudimentary Analysis What is the data distribution? How can you view data from different angles? What does the rudimentary data analysis tell you? Are you satisfied with the analysis? Are there more queries that you cannot answer through this analysis?

Data reduction How many data features do we want in the end? Is it a data reduction problem or data transformation problem? Is it supervised data reduction or unsupervised data reduction problem? Is it linear data reduction or nonlinear data reduction problem?

Choose data mining task Do we apply rule-based methods for better understanding? Do we apply K Nearest neighbor methods for dense data sets? Do we apply SVM methods for accuracy but for black-box models? Is a final result (yes/no) important or the action important (what to do to reduce customer likelihood of being rejected?)

Use Algorithm to Perform Task Which hardware platform to use? Which software platform to use? Is speed and scale more important than visual effects? Is data porting issue important? Is API important or final answer important? How much does each package cost?

Evaluation Do we have separate training and testing data? Is data scarce? What kind of cross validation do we use?  N folds, N=?  Bootstrapping or not? Is ranking important (lift, ROC) or confusion matrix important?

Interpretation What does the results mean? Do we need to support causal effect of the final decisions? Do we need to go back to experts in the domain of application? Do we need visual effects or ranking of final results?

Iteration After obtaining one set of results, do we need to return to the beginning to revise our objectives and obtain new data? How many iterations are needed? Is the process a one shot or continuous process?

Deployment Issues Do we need to integrate with a real online banking system? Do we need to provide API for the software? Do we need to use new data to supplement training data set? If so, how often?