Data Cleansing with SQL and R Kevin Feasel

Slides:



Advertisements
Similar presentations
Topic Denormalisation S McKeever Advanced Databases 1.
Advertisements

Monash University Week 7 Data Modelling Relational Database Theory IMS1907 Database Systems.
Chapter 3 The Relational Model Transparencies © Pearson Education Limited 1995, 2005.
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
Michael F. Price College of Business Chapter 6: Logical database design and the relational model.
MS Access 2007 IT User Services - University of Delaware.
The Relational Database Model
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
© Pearson Education Limited, Chapter 2 The Relational Model Transparencies.
Chapter 4 The Relational Model.
Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.
CREATE THE DIFFERENCE Normalisation (special thanks to Janet Francis for this presentation)
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Concepts and Terminology Introduction to Database.
Module III: The Normal Forms. Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form. The database.
CORE 2: Information systems and Databases NORMALISING DATABASES.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 3 The Relational Database Model.
The Relational Model. 2 Relational Model Terminology u A relation is a table with columns and rows. –Only applies to logical structure of the database,
Lecture 4: Logical Database Design and the Relational Model 1.
Chapter 4 The Relational Model Pearson Education © 2009.
1 Agenda TMA02 M876 Block 4. 2 Model of database development data requirements conceptual data model logical schema schema and database establishing requirements.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Standards and Conventions
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
COP Introduction to Database Structures
Logical Database Design and the Rational Model
Understanding Data Storage
Databases – Exam questions
The Relational Database Model
Databases Chapter 16.
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Relational Database Design by Dr. S. Sridhar, Ph. D
Lecture 2 The Relational Model
Chapter 4 Relational Databases
Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases.
Database Concepts.
Tidy Data Global Health 811 April 3, 2018.
Accounting System Design
Data Modelling Introduction
Normalization Referential Integrity
Relational Database.
Chapter 3 The Relational Model.
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
Entity relationship diagrams
Normalization of Databases
Teaching slides Chapter 8.
Chapter 4 The Relational Model Pearson Education © 2009.
IST 318 Database Administration
Database Processing: David M. Kroenke’s Chapter Six:
Chapter 4.1 V3.0 Napier University Dr Gordon Russell
Relational Database Model
The Entity-Relationship Model
The Relational Model Transparencies
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
Accounting System Design
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
The Relational Database Model
Relational Database Design
Design tools and techniques for a relational database system
Chapter 4 The Relational Model Pearson Education © 2009.
INSTRUCTOR: MRS T.G. ZHOU
Database Processing: David M. Kroenke’s Chapter Six:
Presentation transcript:

Data Cleansing with SQL and R Kevin Feasel Manager, Predictive Analytics ChannelAdvisor

Who am I? What am I doing Here? Curated SQL https://curatedsql.com Tribal SQL http://tribalsql.com @feaselkl

Dirty Data What is dirty data? Inconsistent data Invalid data Incomplete data Inaccurate data Duplicate data

Philosophy The ideal solution is to clean data at the nearest possible point. In rank order: Before it gets into the OLTP system Once it is in the OLTP system ETL process to the warehouse Once it is in the warehouse During data analysis Not all systems follow OLTP => DW => Analysis, so it is valuable to know multiple techniques for data cleansing.

Motivation Today's talk will focus on data cleansing within SQL Server and R, with an emphasis on R. In SQL Server, we will focus on data structures. In R, we will focus on the concept of tidy data. This will necessarily be an incomplete survey of data cleansing techniques, but should serve as a starting point for further exploration. We will not look at Data Quality Services or other data provenance tools in this talk, but these tools are important.

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

Data Consistency Important questions for data consistency: Are there misspellings? Do we use the same values to represent the same things? Do we have data stored in multiple locations which can get out of sync? Are the answers to one question consistent with answers to another question? (Ex: less than 12 years of schooling but has a PhD?)

Data Validity Important questions for data validity: Are the answers physically possible? (Ex: count of pregnant males) Are the answers logically possible? (Ex: 6-year-old with a driver's license) Does a test actually measure what it purports to measure?

Data Completeness Important questions for data completeness: Are there missing values represented by NULL or some default? Are any of the missing values vital for analysis? Do we have a reasonable default for a missing value? (Ex: use the mean/median value or a hard-coded default)

Data Accuracy Important questions for data accuracy: Does an answer look absurd? (Ex: $500 million/hour wage) Do we have multiple sources of data with conflicting results? Do we know how this data was collected? (Ex: survey, sample, direct measurement)

Data Duplication Important questions for data duplication: Do I have a way of knowing whether data is duplicated? (Ex: natural key) Are there "duplicated" results which are not actually duplicates? Can I filter out duplicated results?

Quick Determinations These are rules of thumb. Impossible measurements (e.g., count of pregnant males) should go. Don't waste the space storing that. "Missing" data (e.g., records with some NULL values) should stay, although might not be viable for all analyses. Fixable bad data (e.g., misspellings, errors where intention is known) should be fixed and stay. Unfixable bad data is a tougher call. Could set to default, make a "best guess" change(!!), set to {NA, NULL, Unknown}, or drop from the analysis.

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

Keys and Constraints Relational databases have several data quality constraints: Normalization Data types Primary key constraints Unique key constraints Foreign key constraints Check constraints Default constraints

Normalization When in doubt, go with Boyce-Codd Normal Form. First Normal Form - consistent shape + unique entities + atomic attributes Boyce Codd Normal Form - 1NF + all attributes fully dependent upon a candidate key + every determinant is a key. Check the links for more details!

Data Types Think through your data type choices. Use the best data type (int/decimal for numeric, date/datetime/datetime2/time for date data, etc.) Use the smallest data type which solves the problem (Ex: date instead of datetime, varchar(10) instead of varchar(max))

Constraints Use constraints liberally. Primary key to describe the primary set of attributes which describes an entity. Unique keys to describe alternate sets of attributes which describe an entity. Foreign keys to describe how entities relate. Check constraints to explain valid domains for attributes and attribute combinations. Default constraints when there is a reasonable alternative to NULL.

Demo Time

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

Mapping Tables Sometimes we want to create new tables and data relationships: for example, categorizing data for easier high- level understanding. The ideal way to do this would be to modify the current schema and code to support these new relationships, keeping the data as close to the source as possible. When that is not possible, sometimes we can do the next-best thing: create external relationships without modifying source tables. The worst-case scenario would be to do this cleanup late in the analysis process, as then other analyses cannot take advantage of this data cleansing work.

Demo Time

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

What Is Tidy Data? Notes from Hadley Wickham's Structuring Datasets to Facilitate Analysis. Data sets are made of variables & observations (attributes & entities) Variables contain all values that measure the same underlying attribute (e.g., height, temperature, duration) across units Observations contain all values measured on the same unit (a person, a day, a hospital stay) across attributes

What Is Tidy Data? Notes from Hadley Wickham's Structuring Datasets to Facilitate Analysis. It is easier to describe relationships between variables (age is a function of birthdate and current date) It is easier to make comparisons between groups of attributes (how many people are using this phone number?) Tidy data IS third normal form (or, preferably, Boyce- Codd Normal Form)!

TidyR tidyr is a library whose purpose is to use simple functions to make data frames tidy. It includes functions like gather (unpivot), separate (split apart a variable), and spread (pivot).

Demo Time

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

dplyr tidyr is just one part of the tidyverse. Other tidyverse packages include dplyr, lubridate, and readr. We will take a closer look at dplyr with the next example.

Demo Time

Agenda High-Level Concepts SQL Server – Constraints SQL Server – Mapping Tables R – tidyr R – dplyr R – Data and Outlier Analysis

Data and Outliers Using tidyr, dplyr, and some basic visualization techniques, we can perform univariate and multivariate analysis to determine whether the data is clean. We will focus mostly on univariate and visual analysis in the following example.

Demo Time

Wrapping Up This has been a quick survey of data cleansing techniques. For next steps, look at: SQL Server Data Quality Services Integration with external data sources (APIs to look up UPCs, postal addresses, etc.) Value distribution analysis (Ex: Benford's Law)

Wrapping Up To learn more, go here: https://CSmore.info/on/cleansing And for help, contact me: feasel@catallaxyservices.com | @feaselkl