Data Understanding, Cleaning, Transforming

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

FT228/4 Knowledge Based Decision Support Systems Knowledge Engineering Ref: Artificial Intelligence A Guide to Intelligent Systems, Michael Negnevitsky.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Lecture-19 ETL Detail: Data Cleansing
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
Chapter 1 Data Preprocessing
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

7 Strategies for Extracting, Transforming, and Loading.
Data Verification and Validation
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.
Copyright © 2014 Pearson Canada Inc. 5-1 Copyright © 2014 Pearson Canada Inc. Application Extension 5a Database Design Part 2: Using Information Technology.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
COP Introduction to Database Structures
Application Extension 5a
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
View Integration and Implementation Compromises
Methodology Logical Database Design for the Relational Model
Databases.
Data Mining: Data Preparation
Chapter 5: Logical Database Design and the Relational Model
Data Cleansing with SQL and R Kevin Feasel
Applied CyberInfrastructure Concepts Fall 2017
UNIT-2 Data Preprocessing
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Relational Database.
Chapter 3 The Relational Model.
Semantic Interoperability and Data Warehouse Design
Data Quality By Suparna Kansakar.
Objective of This Course
Semi-Structured data (XML Data MODEL)
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
Chengyu Sun California State University, Los Angeles
Intro to Machine Learning
Data Quality Data Exploration
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
Relational Database Design
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
INSTRUCTOR: MRS T.G. ZHOU
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Pre-processing Lecture Notes for Chapter 2
Semi-Structured data (XML)
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Detecting Data Errors: Where are we and what needs to be done?
Chapter 3 The Relational Model
After the Count: Data Entry and Cleaning
Presentation transcript:

Data Understanding, Cleaning, Transforming

Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform Integration (resolving schema/instance conflicts) Understand/clean/transform (again if necessary) Further pre-processing Modeling/understand the problem Debug, iterate Report, visualization

Other Names for This Step exploration visualization summarize profiling pre-processing understand cleanse scrub tranform validation verification data quality management, …

Data Typically taken to mean schema + data instances Ideally we should use “schema” and “data instances” But often we will say “schema” and “data”

Schema Often Has Many Constraints Key, uniqueness, functional dependencies, foreign keys

Data Often Has Many Constraints Too value range, format, etc.

Understanding, Cleaning, & Transformation understand what schema/data look like right now understand what schema/data should ideally look like identify problems solve prolems Additional transformation

Understand the Current Schema/Data To understand one attribute: min, max, avg, histogram, amount of missing values, value range data type, length of values, etc. synonyms, formats To understand the relationship between two attributes various plots To understand 3+ attributes Data profiling tools can help with inferring constraints eg keys, functional dependencies, foreign key dependencies Other issues cryptic values, abbreviations, cryptic attributes

Understand the Ideal Schema/Data While trying to understand the current schema/data, will gain a measure of understanding the ideal ones May need more information read documents talk with domain experts, owners of schema/data

Identify the Problems Basically clashes between the current and the ideal ones i.e., violations of constraints for the ideal schema/data Schema problems mispelt names violating constraints (key, uniqueness, foreign key, etc) Data problems missing values incorrect values, illegal values, outliers synonyms mispellings conflicting data (eg, age and birth year) wrong value formats variations of values duplicate tuples

Solving the Problems Basically clashes between the current and the ideal ones i.e., violations of constraints for the ideal schema/data Schema problems mispelt names violating constraints (key, uniqueness, foreign key, etc) Data problems missing values incorrect values, illegal values, outliers synonyms mispellings conflicting data (eg, age and birth year) wrong value formats

Solving the Problems Good tools exist for certain types of attributes names, addresses But in general no real good generic tools out there Much research has been done People mostly roll their own set of tools

Examples

Examples (see Google Doc)

Additional Transformations These are not to correct something wrong in schema/data per se Not data cleaning But rather transformations of schema/data into something better suited for our purposes Examples split a field (eg full name) concat of multiple values/fields schema transformation

Examples

Do These for Each Source, then Integrate understand what schema/data look like right now understand what schema/data should ideally look like identify problems solve prolems Additional transformation

Examples

After Data Integration, May Have to Do Understand/Clean/Transform Again Conflicting values (eg age) Inconsistent formats (eg UPC)

Some Other Possible Steps Data enrichment

What Have We Covered So Far? For data from each source understand current vs ideal schema/data compare the two and identify possible problems clean and transform perform additional transformations if necessary possibly enrich/enhance Integrate data from the multiple sources schema matching, data matching May need to do another round of understand/clean/transform (+ enrich/enhance)

Further Generic Pre-Processing Sampling Re-scaling Dimensionality reduction Discretization

Task-Specific Pre-processing E.g., incorrect labels