5 Data Preparation for Data Mining: Chapter 4, Basic Preparation Markku Roiha Global Technology Platform Datex-Ohmeda Group, Instrumentarium Corp.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

C6 Databases.
Type your name in Footer Type file name in Footer Annotating Course Work – A PowerPoint Application Year 8, Unit 5 Use this set of PowerPoint slides to.
The Database Environment
Unit 7: Store and Retrieve it Database Management Systems (DBMS)
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Database Management: Getting Data Together Chapter 14.
Chapter 3: System design. System design Creating system components Three primary components – designing data structure and content – create software –
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Pre-processing for Data Mining 3.1 COT5230 Data Mining Week 3 Pre-processing for Data Mining M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A.
Data Mining By Archana Ketkar.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Database – Part 2 Dr. V.T. Raja Oregon State University.
Employee Central Presentation
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: A Closer Look
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Enterprise systems infrastructure and architecture DT211 4
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Computers Are Your Future Tenth Edition Chapter 12: Databases & Information Systems Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall1.
JOB ORDER COST ACCOUNTING
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Case 2: Emerson and Sanofi Data stewards seek data conformity
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Succeeding with Technology Database Systems Basic Data Management Concepts Organizing Data in a Database Database Management Systems Using Database Systems.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
1 n 1 n 1 1 n n Schema for part of a business application relational database.
Databases and progression. Learning objectives Distinguish between branching tree (binary), flat file, relational and spreadsheet databases Begin to explore.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
13-1 Sequential File Processing Chapter Chapter Contents Overview of Sequential File Processing Sequential File Updating - Creating a New Master.
DATA RESOURCE MANAGEMENT
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
13- 1 Chapter 13.  Overview of Sequential File Processing  Sequential File Updating - Creating a New Master File  Validity Checking in Update Procedures.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Data Mining Copyright KEYSOFT Solutions.
1 The System life cycle 16 The system life cycle is a series of stages that are worked through during the development of a new information system. A lot.
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
Chapter 5-1. Chapter 5-2 Chapter 5: Organizing and Manipulating the Data in Databases Introduction Normalization Validating the Data in Databases Extracting.
MIS 451 Building Business Intelligence Systems Data Staging.
Sports Market Research. Know Your Customer How do businesses know their customers needs and wants?  Ask them/talking to customers  Surveys  Questionnaires.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Data Mining What is to be done before we get to Data Mining?
Information Systems (5 ECTS Credits) Diarmuid Ó Muirgheasa BA BAI MIEI
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction
Introduction to Computing
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Databases.
Relational Database Model
MIS2502: Data Analytics Introduction to Advanced Analytics
Data Warehousing Concepts
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Presentation transcript:

5 Data Preparation for Data Mining: Chapter 4, Basic Preparation Markku Roiha Global Technology Platform Datex-Ohmeda Group, Instrumentarium Corp.

5 Datex-Ohmeda

5 Shortly what is Basic Preparation about  Finding data for mining  Creating and understanding of the quality of data  Manipulating data to remove inaccuracies and redundancies  Making a row-column (text) file of the data

5 Assessing Data - “Data assay”  Data Discovery  Discovering and locating data to be used. Coping with the bureaucrats and data hiders.  Data Characterization  What is it, the data found ? Does it contain stuff needed or is it mere garbage ?  Data Set Assembly  Making an (ascii) table into a file of the data coming from different sources

5 Outcome of data assay  Detailed knowledge in the form of a report on  quality, problems, shortcomings, and suitability of the data for mining.  Tacit knowledge of the database  The miner has a perception of data suitability

5 Data Discovery  Input to data mining is a row-column text file  Original source of the data may be like various databases, flat measurement data files, binary data files etc.  Data Access Issues  Overcome accessibility challenges, like legal issues, cross-departmental access limitations, company politics  Overcome technical challenges, like data format incompatibilities, data storage mediums, database architecture incompatibilities, measurement concurrency issues  Internal/external source and the cost of data

5 Data characterization  Characterize the nature of data sources  Study the nature of variables and usefulness for for modelling  Looking frequency distributions and cross-tabs  Avoiding Garbage In

5 Characterization: Granularity  Variables fall within continuum of very detailed and very aggregated  Sum means aggregation, as well as a mean value  General rule: detailed is preferred over aggregated for mining  Level of aggregation determines the accuracy of model.  One level of aggregation less in input compared to requirement of output  Model outputs weekly variance, use daily measurements for modelling.

5 Characterization: Consistency  Undiscovered inconsistency in data stream leads to Garbage Out model.  If car model is stored as M-B, Mercedes, M-Benz, Mersu it is impossible to detect cross relations between person characteristics and the model of car owned.  Labelling of variables is dependent on the system producing variable data.  Employee means different thing for HR department system and to Payroll system in the precense of contractors  So, how many employees do we have?

5 Characterization: Pollution  Data is polluted if variable label does not reveal the meaning of variable data  Typical sources of pollution  Misuse of record field  B to signify “Business” in gender field of credit card holders -> How do you do statistical analysis based on gender then ?  Data transfer unsuccesful  Misinterpreted fields while copying (comma)  Human resistance  Car sales example, work time report

5 Characterization: Objects  The precise nature of object measured needs to be known  Employee example  Data miner needs to understand why information was captured in the first place  Perspective may color data

5 Characterization: Relationships  Data mining needs a row-column text file for input - This file is created from multiple data streams  Data streams may be difficult to merge  There must be some sort of a key that is common to each stream  Example: different customer ID values in different databases.  Key may be inconsistent, polluted or difficult to get access; there may be duplicates etc.

5 Characterization: Domain  Variable values must be within permissible range of values  Summary statistics and frequency counts reveal out-of-bounds values.  Conditional domanis:  Diagnosis bound to gender  Business rules, like fraud investigation for claims of > $1k  Automated tools to find unknown business rules  WizRule in the CD ROM of the book

5 Characterization: Defaults  Default values in data may cause problems  Conditional defaults dependent on other entries may create fake patterns  but really it is question of lack of data  May be useful patterns but often of limited use

5 Characterization: Integrity  Checking the possible/permitted relationships between variables  Many cars perhaps, but one spouse (except in Utah)  Acceptable range  Outlier may actually be the data we are looking for  Fraud looks often like outlying data because majority of claims are not fraudulent.

5 Characterization: Concurrency  Data capture may be of different epochs  Thus streams may not be comparable at all  Example: Last years tax report and current income/posessions may not match

5 Characterization: Duplicates/Redundancies  Different data streams may involve redundant data - even one source may have redundancies  like dob and age, or  price_per_unit - number_purchased - total_price  Removing redundancies may increase modelling speed  Some algorithms may crash if two variables are identical  Tip: if two variables are almost colinear use difference

5 Data Set Assembly  Data is assembled from different data streams to row-column text file  Then data assessment continues from this file

5 Data Set Assembly: Reverse Pivoting  Feature extraction by sorting data by one key from transactions and deriving new fields  E.g. from transaction data to customer profile

5 Data Set Assembly: Feature Extraction  Choice of variables to extract means how data is presented to data mining tool  Miner must judge which features are predictive  Choice cannot be automated but actual extraction of features can.  Reverse pivot is not the only way extract features  Source variables may be replaced by derived variables  Physical models: flat most of time - take only sequences where there is rapid changes

5 Data Set Assembly: Explanatory Structure  Data miner needs to have an idea how data set can address problem area  It is called the explanatory structure of data set  Explains how variables are expected to relate to each other  How data set relates to solving the problem  Sanity check: Last phase of data assay  Checking that explanatory structure actually holds as expected  Many tools like OLAP

5 Data Set Assembly: Enhancement/Enrichment  Assembled data set may not be sufficent  Data set enrichment  Adding external data to data set  Data enhancement  embellishing or expanding data set w/o external data  Feature extraction,  adding bias  remove non-responders from data set  data multiplication  Generate rare events (add some noise)

5 Data Set Assembly: Sampling Bias  Undetected sampling bias may ruin the model  US Census: cannot find poorest segment of the society - no home, no address  Telephone polls: have to own a telephone, have to be willing to share opinions over phone lines  At this phase - the end of data assay miner needs to realize existence of possible bias and explain it

5 Example 1: CREDT  Study of data source report  to find out integrity of variables  to find out expected relationships between variables for integrity assessment  Tools for single variable integrity study  Status report for Credit file  Complete Content Report  Leads to removing some variables  Tools for cross correlation analysis  KnowledgeSeeker - chi-square analysis  Checking that expected relationships are there

5 Example 1: CREDT: Single-variable status  Conclusions  BEACON_C lin>0.98  CRITERIA: constant  EQBAL: empty, distinct values?  DOB month: sparse,14 values?  HOME VALUE:min 0.0? Rent/own?

5 Example 1: Relationships  Chi-square analysis  AGE_INFERR expectation it correlates w/ DOB_YEAR  Right, it does - data seems ok  Do we need both ? Remove other ?  HOME_ED correlates with PRCNT_PROF  Right, it does - data seems ok  Talk about bias  Introducing bias for e.g. increase number of child-bearing families to study marketing of child- related products.

5 Example 2: Shoe  What is interesting here:  WizRule to find out probable hidden rules from data set.

5 Data Assay  Assessment of quality of data for mining  Leads to assembly of data sources to one file.  How to get data and does it suit the purpose  Main goal: miner understands where the data come from, what is there, and what remains to be done.  It is helpful to make a report on the state of data  It involves miner directly - rather than using automated tools  After assay rest can be carried out with tools