Data Preparation as a Process Markku Ursin

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Computer Engineering 203 R Smith Project Tracking 12/ Project Tracking Why do we want to track a project? What is the projects MOV? – Why is tracking.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Artificial Neural Networks
The Nature of the World and Its Impact on Data Preparation Jani Mattsson
Introduction to Data Mining with XLMiner
Machine Learning Neural Networks
VLSI Systems--Spring 2009 Introduction: --syllabus; goals --schedule --project --student survey, group formation.
Design The goal is to design a modular solution, using the techniques of: Decomposition Abstraction Encapsulation In Object Oriented Programming this is.
Artificial Neural Networks Artificial Neural Networks are (among other things) another technique for supervised learning k-Nearest Neighbor Decision Tree.
Overview of The Operations Research Modeling Approach.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
Neural Networks Chapter Feed-Forward Neural Networks.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
(c) 2007 Mauro Pezzè & Michal Young Ch 10, slide 1 Functional testing.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Data Mining Techniques
Module 04: Algorithms Topic 07: Instance-Based Learning
Issues with Data Mining
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
COMP3503 Intro to Inductive Modeling
Chapter 9 Neural Network.
Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.
Copyright © 2010, SAS Institute Inc. All rights reserved. Applied Analytics Using SAS ® Enterprise Miner™
NEURAL NETWORKS FOR DATA MINING
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Relationship Between in-situ Information and ex-situ Metrology in Metal Etch Processes Jill Card, An Cao, Wai Chan, Bill Martin, Yi-Min Lai IBEX Process.
ENM 503 Lesson 1 – Methods and Models The why’s, how’s, and what’s of mathematical modeling A model is a representation in mathematical terms of some real.
Data Survey Chapters in Data Preparation for Data Mining by Dorian Pyle Martti Kesäniemi.
Knowledge discovery process Chapter 1 Juha Vesanto
Topics Covered Phase 1: Preliminary investigation Phase 1: Preliminary investigation Phase 2: Feasibility Study Phase 2: Feasibility Study Phase 3: System.
1 CSC 9010 Spring Paula Matuszek CSC 9010 ANN Lab Paula Matuszek Spring, 2011.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Stage 1 Statistics students from Auckland university Using a sample to make a point estimate.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Chapter 3 Data Mining Methodology and Best Practices
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Knowledge Discovery and Data Mining: Know What You are looking For Knowledge discovery process Using knowledge discovered Data mining.
CHAPTER 4 THE VISUALIZATION PIPELINE. CONTENTS The focus is on presenting the structure of a complete visualization application, both from a conceptual.
Data Mining and Decision Support
Preparing to collect data. Make sure you have your materials Surveys –All surveys should have a unique numerical identifier on each page –You can write.
© 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Survey Training Pack Session 18 – Checking Data Analysis.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining: Concepts and Techniques
Prepared by: Mahmoud Rafeek Al-Farra
Prediction as Data Mining Task
Advanced Analytics Using Enterprise Miner
An Introduction to Support Vector Machines
Generalization ..
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Data Mining Practical Machine Learning Tools and Techniques
Lecture # 3 Software Development Project Management
COURSE ON DESIGN OF EXPERIMENTS
The Organizational Impacts on Software Quality and Defect Estimation
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
DeltaV Neural - Expert In Expert mode, the user can select the training parameters, recommend you use the defaults for most applications.
Presentation transcript:

Data Preparation as a Process Markku Ursin

Introduction Purpose: make the data better accessible for the mining tool No magical general purpose techniques, preparation is half art, half science Knowing the limitations and correct use of techniques is more important than thoroughly understanding the actual techniques

Data Mining Process (simplified) 1. Data Preparation 2. Data Survey 3. Data Modeling

Data Preparation Process

Training and Test Data Sets

Prepared Information Environment Modules Input module transforms raw execution data: –categorical values into numerical –filling in / ignoring missing values Output module undoes the effect of PIE-I Used between the model and the real world

Modeling Tools and Data Preparation Right tool for the right job Early general-purpose mining tools were algorithm centric Modern tools concentrate on business problems “Getting the job done is enough, we don’t need to know how.”

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Data Separation Straight lines parallel to axes Straight lines not parallel to axes Curves Closed area Ideal arrangement

Algorithms for Data Separation Decision Trees Decision Lists Neural Networks Evolution Programs

Modeling Data with the Tools Discrete and continuous tools - different approaches to different problems Binning vs. continuos algorithms It may be worthwhile trying different techniques for preparation Missing and empty values

Stages of Data Preparation Accessing the data –not trivial in many cases! –Very case dependent

Stages of Data Preparation Accessing the data Auditing the data –examining the quality, quantity and source of data –make sure the minimum requirements for solution are filled, forget unsupported hopes

Stages of Data Preparation Accessing the data Auditing the data Enhancing and enriching the data –add more data if needed –apply domain knowledge to ease the work of the tool

Stages of Data Preparation Accessing the data Auditing the data Enhancing and enriching the data Looking for sampling bias –data sets must accurately represent the population –failure may lead to useless models

Stages of Data Preparation Accessing the data Auditing the data Enhancing and enriching the data Looking for sampling bias Determining data structure –superstructure: selected scaffolding –macrostructure: eg. granularity –microstructure: relationships between variables

Stages of Data Preparation Building the PIE, data issues: –representative samples –categorical values –normalization –missing and empty values –reducing width and depth –well- and ill-formed manifolds

Correcting Problems with Ill-Formed Manifolds

Stages of Data Preparation Accessing the data Auditing the data Enhancing and enriching the data Looking for sampling bias Determining data structure Building the PIE Surveying the Data Modeling the Data

Summary Some data preparation is needed for all mining tools The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool Error prediction rate should be lower (or the same) after the preparation as before it The miner gains very good insight on the problem during the preparation process