Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.

Slides:



Advertisements
Similar presentations
Environment case Episode 3 - CAATS II Final Dissemination Event Brussels, 13 & 14 Oct 2009 Hellen Foster, Jarlath Molloy NATS, Imperial College London.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Lecture 2: Processes, Requirements, and Use Cases.
<<Date>><<SDLC Phase>>
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Fakultät für Informatik Technische Universität München A Quantitative Perspective on Systems of Systems Formerly: Upscaling for Systems of Systems Astrid,
Rational Unified Process
/faculteit technologie management Introduction to Data Mining a.j.m.m. (ton) weijters (slides are partially based on an introduction of Gregory Piatetsky-Shapiro)
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Mining the Data Ira M. Schoenberger, FACHCA Senior Administrator 2011 AHCA/NCAL Quality Symposium Friday February 18, 2011.
Business Systems Intelligence: 7. B.I. Methodologies Dr. Brian Mac Namee (
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Lecture 13 Revision IMS Systems Analysis and Design.
Part II Tools for Knowledge Discovery. Knowledge Discovery in Databases Chapter 5.
CS590D: Data Mining Chris Clifton March 22, 2006 Data Mining Process Thanks to Laura Squier, SPSS for some of the material used.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Overview of Software Requirements
Pertemuan Matakuliah: A0214/Audit Sistem Informasi Tahun: 2007.
Chapter 10: Architectural Design
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Codex Guidelines for the Application of HACCP
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Dr. Awad Khalil Computer Science Department AUC
More on Data Mining KDnuggets Datanami ACM SIGKDD
Chapter 10 Architectural Design
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Understanding Data Analytics and Data Mining Introduction.
Overview of the Database Development Process
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
The CRISP-DM Process Model
ITEC224 Database Programming
1 Chapter 2 The Process. 2 Process  What is it?  Who does it?  Why is it important?  What are the steps?  What is the work product?  How to ensure.
Requirements Engineering
Chapter 6 : Software Metrics
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Lecture 7: Requirements Engineering
1 Introduction to Software Engineering Lecture 1.
Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
| 1 › Matthias Galster, University of Groningen, NL › Armin Eberlein, American University of Sharjah, UAE Facilitating Software Architecting by.
Search Engine Optimization © HiTech Institute. All rights reserved. Slide 1 What is Solution Assessment & Validation?
1 Chapter 3 1.Quality Management, 2.Software Cost Estimation 3.Process Improvement.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Process Quality in ONS Rachel Skentelbery, Rachael Viles & Sarah Green
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
WERST – Methodology Group
Requirements Engineering Processes. Syllabus l Definition of Requirement engineering process (REP) l Phases of Requirements Engineering Process: Requirements.
Data Mining and Decision Support
Modelling the Process and Life Cycle. The Meaning of Process A process: a series of steps involving activities, constrains, and resources that produce.
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Data Mining Copyright KEYSOFT Solutions.
CRISP-DM Tommy Wei Cory Hutchinson ISDS Overview What is CRISP-DM (CRoss Industry Standard Process for Data Mining) Blueprint Phases and Tasks Summary.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter Two Copyright © 2006 McGraw-Hill/Irwin The Marketing Research Process.
The Software Lifecycle Stuart Faulk. Definition Software Life Cycle: evolution of a software development effort from concept to retirement Life Cycle.
Chapter (12) – Old Version
DATA MINING © Prentice Hall.
A Methodology for Finding Bad Data
Generic Statistical Business Process Model (GSBPM)
A Unifying View on Instance Selection
Regulatory Oversight of HOF in Finland
CRISP Process Stephen Wyrick.
Practical Database Design and Tuning Objectives
Presentation transcript:

Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard processes: – CRISP-DM (Cross-Industry Standard Process for Data Mining) – SEMMA (Sample, Explore, Modify, Model, and Assess) – KDD (Knowledge Discovery in Databases)

What is CRISP-DM? Cross-Industry Standard Process for Data Mining Aim: – To develop an industry, tool and application neutral process for conducting Knowledge Discovery – Define tasks, outputs from these tasks, terminology and mining problem type characterization

Data Mining Process: CRISP-DM Not linear, repeatedly backtracking

Data Mining Process: CRISP-DM Step 1: Business Understanding Step 2: Data Understanding Step 3: Data Preparation (!) Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment The process is highly repetitive and experimental (DM: art versus science?) Accounts for ~85% of total project time

Business Understanding Phase Understand the business objectives – Understand business processes – Define the success criteria – Develop a glossary of terms – Cost/Benefit Analysis Current Systems Assessment – Identify the key actors – What forms should the output take? – Integration of output with existing technology landscape – Understand market norms and standards

Business Understanding Phase Task Decomposition – Break down the objective into sub-tasks – Map sub-tasks to data mining problem definitions Identify Constraints – Resources – Law e.g. Data Protection Build a project plan – List assumptions and risk (technical/financial/business/ organisational) factors

Data Understanding Phase Define data sources – What are the data sources? Internal and External Sources Depend on a domain expert Accessibility issues – Legal and technical – Are there issues regarding data distribution across different databases/legacy systems Where are the disconnects?

Data Understanding Phase II Data Description – Document data quality issues requirements for data preparation – Compute basic statistics Data Exploration – Investigate attribute interactions – Data Quality Issues Missing Values – Understand its source: Missing vs Null values Strange Distributions

Data Preparation Phase Integrate Data – Joining multiple data tables – Summarisation/aggregation of data Select Data – Attribute subset selection – Data sampling Training/Validation and Test sets

Data Preparation Phase II Data Transformation – Using functions such as log – Factor/Principal Components analysis – Normalization/Discretisation/Binarisation Clean Data – Handling missing values/Outliers Data Construction – Derived Attributes

Data Preparation – A Critical DM Task

The Modelling Phase Select of the appropriate modelling technique – Data pre-processing implications Attribute independence Data types/Normalisation/Distributions – Dependent on Data mining problem type Output requirements Develop a testing regime – Sampling Verify samples have similar characteristics and are representative of the population

The Modelling Phase Build Model – Choose initial parameter settings – Study model behaviour Sensitivity analysis Assess the model – Beware of over-fitting – Investigate the error distribution Identify segments of the state space where the model is less effective – Iteratively adjust parameter settings Document reasons of these changes

The Evaluation Phase Validate Model – Human evaluation of results by domain experts – Evaluate usefulness of results from business perspective Define control groups Expected Return on Investment Review Process Determine next steps – Potential for deployment – Deployment architecture – Metrics for success of deployment

The Deployment Phase Knowledge Deployment is specific to objectives – Knowledge Presentation – Integration with the current IT infrastructure – Generation of a report Online/Offline – Monitoring and evaluation of effectiveness Process deployment/production Produce final project report – Document everything along the way