Chapter 17 Preparing Data for Mining. 2 Introduction Just as manufacturing and refining are about transformation of raw materials into finished products,

Slides:



Advertisements
Similar presentations
Chapter 5 Normalization of Database Tables
Advertisements

ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
© 2012 IBM Corporation Information Management How to create right-sized test database ? Step-by-step Use Case Jan Musil, Database Specialist, Community.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
© Steven Alter, 2007, all rights reserved Database concepts Difference between a database and the Internet Reason for having a defined data structure Relational.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Databases and Processing Modes. Fundamental Data Storage Concepts and Definitions What is an entity? An entity is something about which information is.
Designing a Database Unleashing the Power of Relational Database Design.
Databases Week 7 LBSC 690 Information Technology.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Entity Relationship Model Chapter 6. Basic Elements of E-R Model Entity Object of the real world that stores data. Eg. Customer, State, Project, Supplier,
On Demand Data Mining OD D M ™ from LaserSearch, Inc.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
What is a database? An organized collection of data. This can be in an electronic, paper, or other format. Types of databases Operational -constantly changing.
Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 1 Data Mining and Knowledge Discovery (DM & KD) prof. dr. Bojan Cestnik Temida d.o.o. & Jozef.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Working with Derby. RHS – Creating tables We know how to create a database in Derby – an empty database Next step is to add tables to the database.
Introduction to SPSS Edward A. Greenberg, PhD
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Module Title? Data Base Design 30/6/2007 Entity Relationship Diagrams (ERDs)
Foundations of Computer Science Computing …it is all about Data Representation, Storage, Processing, and Communication of Data 10/4/20151CS 112 – Foundations.
The DM Process – MS’s view (DMX). The Basics  You select an algorithm, show the algorithm some examples called training example and, from these examples,
Analyzing and Interpreting Quantitative Data
© 2007 by Prentice Hall 1 Introduction to databases.
Chapter 4 Introduction to MySQL. MySQL “the world’s most popular open-source database application” “commonly used with PHP”
Chapter 8. ATTRIBUTE DATA INPUT AND MANAGEMENT
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 17 Creating a Database.
CS370 Spring 2007 CS 370 Database Systems Lecture 4 Introduction to Database Design.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
CPSC 603 Database Systems Lecturer: Laurie Webster II, M.S.S.E., M.S.E.E., M.S.BME, Ph.D., P.E. Lecture 2 Introduction to a First Course in Database Systems.
Domain Model—Part 2: Attributes.  A logical data value of an object  (Text, p. 158)  In a domain model, attributes and their data types should be simple,
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Database Terms Hernandez, Chapter 3. Data/Information The values you store in the database are data. Pieces of Data in and of themselves is not particularly.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Database Design – Lecture 6 Moving to a Logical Model.
BTS430 Systems Analysis and Design using UML Domain Model—Part 2A: Attributes.
Mr C Johnston ICT Teacher
Chapter 10 Designing Databases. Objectives:  Define key database design terms.  Explain the role of database design in the IS development process. 
* Database is a group of related objects * Objects can be Tables, Forms, Queries or Reports * All data reside in Tables * A Row in a Table is a record.
Entity Relationship Diagram (ERD). Objectives Define terms related to entity relationship modeling, including entity, entity instance, attribute, relationship.
CSCI 6962: Server-side Design and Programming Shopping Carts and Databases.
Working with MySQL. SWC – Creating tables We know how to create a database in MySQL – an empty database Next step is to add tables to the database.
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 6-1 David M. Kroenke’s Chapter Six: Transforming Data Models into Database.
McGraw-Hill/Irwin Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Modeling the Data: Conceptual and Logical Data Modeling.
1.4 Graphs for Quantitative Data Chapter 1 (Page 17)
DOMAIN MODEL—PART 2: ATTRIBUTES BTS430 Systems Analysis and Design using UML.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
GCSE ICT Data Transfer. Data transfer Users often need to transfer data between software packages or computers. Until relatively recently this was difficult.
2b. Create an Access Database Lingma Acheson Department of Computer and Information Science IUPUI CSCI N207 Data Analysis with Spreadsheets 1.
CCT396, Fall 2011 Database Design and Implementation Yuri Takhteyev University of Toronto This presentation is licensed under Creative Commons Attribution.
Geog. 314 Working with tables.
XBRL-CSV Overview.
GCSE ICT Data Transfer.
Databases Chapter 16.
Practical Office 2007 Chapter 10
Lecture Notes for Chapter 2 Introduction to Data Mining
LIS 384K.11 Database-Management Principles and Applications
CSIS 115 Database Design and Applications for Business
Analyzing and Interpreting Quantitative Data
Programming the Web using XHTML and JavaScript
Data Modeling and Database Design INF1343, Winter 2012 Yuri Takhteyev
Microsoft Applications
Chapter 4 Introduction to MySQL.
ICT Database Lesson 2 Designing a Database.
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Chapter 17 Preparing Data for Mining

2 Introduction Just as manufacturing and refining are about transformation of raw materials into finished products, so too with data to be used for data mining ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining The goal: ideal DM environment (Ch 16)

3 What the Data Should Look Like All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table

4 What the Data Should Look Like Customer Signature –Continuous “snapshot” of customer behavior Each row represents the customer and whatever might be useful for data mining

5 What the Data Should Look Like The columns –Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) –Contain the results of calculations referred to as derived variables (e.g., total sales $)

6 What the Data Should Look Like 1.Columns with One Value - Often not very useful 2.Columns with Almost Only One Value 3.Columns with Unique Values 4.Columns Correlated with Target Variable (synonyms with the target variable)

7 What the Data Should Look Like Columns have important Model Roles in Data Mining: –Input columns – input into the model –Target column(s) – used only for predictive models – the values are created by the algorithm –Ignored columns – not used in a particular data mining analysis

8 What the Data Should Look Like Variable Measures –Categorical variables (e.g., CA, AZ, UT…) –Ordered variables (e.g., course grades) –Interval variables (e.g., temperatures) –True numeric variables (e.g., money) Dates & Times Fixed-Length Character Strings (e.g., Zip Codes) IDs and Keys – used for linkage to other data in other tables Names (e.g., Company Names) Addresses Free Text (e.g., annotations, comments, memos, ) Binary Data (e.g., audio, images)

9 What the Data Should Look Like Data Format Expectations for Data Mining –All data in a single table (rows/columns) –Each row corresponds to an entity (customer) –Columns with single value should be ignored –Columns with unique values should be ignored –Target column identified for predictive DM

10 Constructing the Customer Signature

11 Typical Customer Model 3 different definitions of Customer

12 The “Dark Side” of Data Missing values (nulls = empty or something else) Dirty data (erroneous zip codes, etc.) Inconsistent values (different revisions)

13 Conclusion Lots to think about and take action on for Preparing Data for Mining Remember a process/methodology is needed which includes ECTL (extraction, clean, transform, load) Remember Data Mining Group Skills needed

14 End of Chapter 17