Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

FT228/4 Knowledge Based Decision Support Systems Knowledge Engineering Ref: Artificial Intelligence A Guide to Intelligent Systems, Michael Negnevitsky.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Lecture-19 ETL Detail: Data Cleansing
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Michael F. Price College of Business Chapter 6: Logical database design and the relational model.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data Mining Techniques
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
® IBM Software Group © IBM Corporation IBM Information Server Understand - Information Analyzer.
Some Key Questions about you Data Damian Gordon Brendan Tierney Brian Mac Namee.
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Normalization (Codd, 1972) Practical Information For Real World Database Design.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
M1G Introduction to Database Development 2. Creating a Database.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Database Application Design and Data Integrity AIMS 3710 R. Nakatsu.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality Introduction to Parsing.
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

7 Strategies for Extracting, Transforming, and Loading.
Data Verification and Validation
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Introduction to Database Programming with Python Gary Stewart
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining: Data Preparation
Data Cleansing with SQL and R Kevin Feasel
Potter’s Wheel: An Interactive Data Cleaning System
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Understanding, Cleaning, Transforming
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
INSTRUCTOR: MRS T.G. ZHOU
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Detecting Data Errors: Where are we and what needs to be done?
Presentation transcript:

Data Understanding, Cleaning, Transforming

Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform Integration (resolving schema/instance conflicts) Understand/clean/transform (again if necessary) Further pre-processing Modeling/understand the problem Debug, iterate Report, visualization 2

Other Names for This Step exploration visualization summarize profiling pre-processing understand cleanse scrub tranform validation verification data quality management, … 3

Data Typically taken to mean schema + data instances Ideally we should use “schema” and “data instances” But often we will say “schema” and “data” 4

Schema Often Has Many Constraints 5 Key, uniqueness, functional dependencies, foreign keys

Data Often Has Many Constraints Too 6 value range, format, etc.

Understanding, Cleaning, & Transformation understand what schema/data should ideally look like understand what schema/data look like right now solve prolems identify problems Additional transformation

Understand the Current Schema/Data To understand one attribute: –min, max, avg, histogram, amount of missing values, value range –data type, length of values, etc. –synonyms, formats To understand the relationship between two attributes –various plots To understand 3+ attributes Data profiling tools can help with inferring constraints –eg keys, functional dependencies, foreign key dependencies Other issues –cryptic values, abbreviations, cryptic attributes 8

Understand the Ideal Schema/Data While trying to understand the current schema/data, will gain a measure of understanding the ideal ones May need more information –read documents –talk with domain experts, owners of schema/data 9

Identify the Problems Basically clashes between the current and the ideal ones –i.e., violations of constraints for the ideal schema/data Schema problems –mispelt names –violating constraints (key, uniqueness, foreign key, etc) Data problems –missing values –incorrect values, illegal values, outliers –synonyms –mispellings –conflicting data (eg, age and birth year) –wrong value formats –variations of values –duplicate tuples 10

Solving the Problems Basically clashes between the current and the ideal ones –i.e., violations of constraints for the ideal schema/data Schema problems –mispelt names –violating constraints (key, uniqueness, foreign key, etc) Data problems –missing values –incorrect values, illegal values, outliers –synonyms –mispellings –conflicting data (eg, age and birth year) –wrong value formats 11

Solving the Problems Good tools exist for certain types of attributes –names, addresses But in general no real good generic tools out there Much research has been done People mostly roll their own set of tools 12

Examples 13

Examples (see Google Doc) 14

Additional Transformations These are not to correct something wrong in schema/data per se Not data cleaning But rather transformations of schema/data into something better suited for our purposes Examples –split a field (eg full name) –concat of multiple values/fields –schema transformation 15

Examples 16

Do These for Each Source, then Integrate understand what schema/data should ideally look like understand what schema/data look like right now solve prolems identify problems Additional transformation

Examples 18

After Data Integration, May Have to Do Understand/Clean/Transform Again Conflicting values (eg age) Inconsistent formats (eg UPC) 19

Some Other Possible Steps Data enrichment 20

What Have We Covered So Far? For data from each source –understand current vs ideal schema/data –compare the two and identify possible problems –clean and transform –perform additional transformations if necessary –possibly enrich/enhance Integrate data from the multiple sources –schema matching, data matching May need to do another round of understand/clean/transform (+ enrich/enhance) 21

22

Further Generic Pre-Processing Sampling Re-scaling Dimensionality reduction Discretization 23

Task-Specific Pre-processing E.g., incorrect labels 24