An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems Master’s Defense Hajar Homayouni Dec. 1, 2017.

Slides:



Advertisements
Similar presentations
CC SQL Utilities.
Advertisements

CSCI3170 Introduction to Database Systems
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Introduction to Structured Query Language (SQL)
Software Quality Metrics
Introduction to Structured Query Language (SQL)
Database Management: Getting Data Together Chapter 14.
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Concepts of Database Management Sixth Edition
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Introduction to Structured Query Language (SQL)
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
MS Access: Database Concepts Instructor: Vicki Weidler.
Students: Nadia Goshmir, Yulia Koretsky Supervisor: Shai Rozenrauch Industrial Project Advanced Tool for Automatic Testing Final Presentation.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Relational DBs and SQL Designing Your Web Database (Ch. 8) → Creating and Working with a MySQL Database (Ch. 9, 10) 1.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
DAY 14: ACCESS CHAPTER 1 Tazin Afrin October 03,
An Investigation of Oracle and SQL Server with respect to Integrity, and SQL Language standards Presented by: Paul Tarwireyi Supervisor: John Ebden Date:
Effectively Validate Query/Report: Strategy and Tool Steven Luo Sr. System Analyst Barnes & Noble Session id:
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Relational Database Management Systems. A set of programs to manage one or more databases Provides means for: Accessing the data Inserting, updating and.
Chapter 10 Selected Single-Row Functions Oracle 10g: SQL.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Oracle 11g: SQL Chapter 10 Selected Single-Row Functions.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
Database Security Outline.. Introduction Security requirement Reliability and Integrity Sensitive data Inference Multilevel databases Multilevel security.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Computer Science 1 Mining Likely Properties of Access Control Policies via Association Rule Mining JeeHyun Hwang 1, Tao Xie 1, Vincent Hu 2 and Mine Altunay.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 16 Using Relational Databases.
A Metrics Program. Advantages of Collecting Software Quality Metrics Objective assessments as to whether quality requirements are being met can be made.
7 Strategies for Extracting, Transforming, and Loading.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
Chapter 5 : Integrity And Security  Domain Constraints  Referential Integrity  Security  Triggers  Authorization  Authorization in SQL  Views 
Session 1 Module 1: Introduction to Data Integrity
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Day 5 - More Complexity With Queries Explanation of JOIN & Examples Explanation of JOIN & Examples Explanation & Examples of Aggregation Explanation &
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Database Constraints Ashima Wadhwa. Database Constraints Database constraints are restrictions on the contents of the database or on database operations.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Getting started with Accurately Storing Data
SQL Query Getting to the data ……..
Estimate Testing Size and Effort Using Test Case Point Analysis
Relational Database Design
Chapter 10 Selected Single-Row Functions Oracle 10g: SQL
Chapter Ten Managing a Database.
SQL: Advanced Options, Updates and Views Lecturer: Dr Pavle Mogin
Database Management Systems
CIS 336 str Competitive Success/snaptutorial.com
CIS 336 str Education for Service- -snaptutorial.com.
CIS 336 STUDY Lessons in Excellence-- cis336study.com.
CIS 336 str Teaching Effectively-- snaptutorial.com.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
File Systems and Databases
Contents Preface I Introduction Lesson Objectives I-2
Relational Database Design
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Instructor Materials Chapter 5: Ensuring Integrity
Detecting Data Errors: Where are we and what needs to be done?
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems Master’s Defense Hajar Homayouni Dec. 1, 2017

Outline Survey of data warehouse testing research Problem description and goal Motivating example Proposed approach for functional testing of ETL Demonstration and evaluation Conclusions Future work

Data Warehouse Systems Help researchers and analyzers: Make accurate analysis and decisions Find precise patterns and trends in data Source Area Relational/non-relational database XML file CSV file Hospitals Front-end Applications Analysis Tools Decision Support Tools Analysis on drugs and treatments Health Data Warehouse Google BigQuery E T L

Extract-Transform-Load Process

Data Warehouse Testing Survey Goals Create a classification framework for data warehouse testing approaches Survey existing techniques and their limitations Identify open problems

Classification Framework for Data Warehouse Testing Approaches Source Area Testing Target Data Warehouse Testing Extract-Transform-Load Testing Front-end Applications Testing Underlying Data Testing Data Model Testing Data Management Product Functional Testing Functional Testing Functional Testing Functional Testing Performance Testing Performance Testing Usability Testing Security Testing Structural Evaluation Stress Testing Stress Testing Performance Testing Usability Evaluation Recovery Testing Scalability Testing Stress Testing Maintainability Evaluation Reliability Testing Regression Testing Usability Testing

Functional Testing of ETL Process Data Quality Tests: Validate data in the target data warehouse in isolation to detect syntactic and semantic violations Example: Patient’s height and weight values stored in the target data warehouse must be positive Balancing Tests: Analyze the data in the source databases and target data warehouse and report differences Example: Date of birth must be consistent between target data warehouse and source hospital for the same patient

Limitations of Balancing Test Approaches Stare and compare: Manual approach that requires exhaustive comparison of data Query Surge: Automatic approach but only verifies data that is not modified during the ETL transformation

Goal Propose an automatic approach for validating ETL processes using balancing tests that: Ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process Include validating data that has been reformatted and modified through the ETL process

Sources of Faults in Data Complexity of transformations make ETL implementations fault prone Faulty ETL implementations lead to incorrect data in data warehouse System failures or connection loss may result in data loss or duplication Erroneous setting of ETL parameters can result in incorrect data Malicious programs may remove or modify data in data warehouses Incorrect modes may be unclear, so be careful when describing the bullets.

Motivating Example Table mappings One-to-one table mapping Many-to-one table mapping Does the Person table lose any current patients? Does the Location table lose any addresses of patients admitted after Year 2000? Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Address Address_key Person Location_id Transform all the current patients with their new addresses (Year>2000) LEFT JOIN

Motivating Example (cont.) Attribute mappings One-to-one attribute mapping Many-to-one attribute mapping Are any of the attribute values incorrect? Is Date_of_birth the correct concatenation of three attributes in the source? Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Day_of_birth Month_of_birth Year_of_birth Person Date_of_birth Transform all the current patients

Proposed Approach Define properties to be checked in balancing tests Identify source-to-target mappings from transformation rules Generate balancing tests

Balancing Properties Completeness Consistency Syntactic validity

Completeness Ensures that all the source data that must be transformed and loaded are present in the warehouse Verified by checking: Record count match Distinct record count match Carefully define in your talk the two things that are verified

Record Count Match One-to-one table mapping Many-to-one table mapping COUNT(target_table) = COUNT(source_table) Where condition Example: COUNT(Location) = COUNT(Address) Where (Year>2000) Many-to-one table mapping N and M are the number of records in the left and right source tables Upper bound COUNT(target_table) in UNION=N+M Upper bound COUNT(target_table) in JOIN=N∗M Lower bound COUNT(target_table) in JOIN= 0 Lower bound COUNT(target_table) in LEFT JOIN=N Lower bound COUNT(target_table) in RIGHT JOIN=M COUNT(Person) >= COUNT(Patient) Where (IsCurrent=TRUE)

Distinct Record Count Match One-to-one table mapping DISTINCT COUNT(target_table) = COUNT(source_table) Where condition Example: DISTINCT COUNT(Location) = COUNT(Address) Where (Year>2000) Many-to-one table mapping Upper bound DISTINCT COUNT(target_table) in UNION=N+M Upper bound DISTINCT COUNT(target_table) in JOIN=N∗M Lower bound DISTINCT COUNT(target_table) in JOIN= 0 Lower bound DISTINCT COUNT(target_table) in LEFT JOIN=N Lower bound DISTINCT COUNT(target_table) in RIGHT JOIN=M Where N and M are the number of records in the left and right source tables DISTINCT COUNT(Person) >= COUNT(Patient) Where (IsCurrent=TRUE)

Consistency Ensures that attribute contents in target table conform to the corresponding ones in source table(s) based on transformation rules Verified by checking: Attribute value match Attribute constraint match Attribute outlier match Attribute average match

Consistency Example Attribute value match IN ? IN ? CONCAT Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) IN ? Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Day_of_birth Month_of_birth Year_of_birth Person Date_of_birth Transform all the current patients CONCAT IN ?

Consistency Example (Cont.) Attribute constraint match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) NOT NULL NOT NULL? Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Day_of_birth Month_of_birth Year_of_birth Person Date_of_birth Transform all the current patients Year_of_birth < Year.Now() Date_of_birth < Date.Now()?

Consistency Example (Cont.) Attribute outlier/average match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) Min/Max/Avg Min/Max/Avg =? Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Weight Height Person BMI Transform all the current patients Min/Max/Avg Min/Max/Avg BMI =? BMI=Avg(Weight)/ Avg(Height)^2

Syntactic Validity Ensures that syntax of attributes in target table conforms to syntax of the corresponding ones in source table(s) based on transformation rules Verified by checking: Attribute data type match Attribute length match

Syntactic Validity Example Attribute data type match Attribute length match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_id Transform all the new addresses (Year>2000) BigInt INT64? Source Table Source Attribute Target Table Target Attribute Selection Condition Patient Day_of_birth Month_of_birth Year_of_birth Person Date_of_birth Transform all the current patients Average Length Average Length CONCAT =?

Identify Source-to-Target Mappings Table mapping Target table: Insert into clause Source table(s): From clause Table operation(s): Join/union clause(s) Selection condition: Where clause Attribute mapping Source/target attribute(s): As clause Attribute operation(s): Aggregation function(s) Identify source-to-target mappings ETL transformation rules Mapping table

Generate Balancing Tests Translator for SQL Dialects Mapping Table Test Assertions Generate Analysis Queries Properties Generate Test Assertions Source Analysis Values Target Analysis Values

Generate Analysis Queries For each of the mapping rows, generate analysis queries to check the balancing properties Record count match Source analysis query: SELECT COUNT(Patient) Where IsCurrent=TRUE Target analysis query: SELECT COUNT(Person) Attribute outlier match Source analysis query: SELECT MIN/MAX(Avg(Weight) / (Avg(Height)^2) Where IsCurrent=TRUE Target analysis query: SELECT MIN/MAX(BMI) Source Table Table Operation Source Attribute Attribute Operation Target Table Target Attribute Selection Condition [0] Patient [1] Address Left join Address_Key Person Location_id IsCurrent=TRUE and Year>2000 Patient [0] Weight [1] Height Avg(Weight)/ Avg(Height)^2 BMI IsCurrent=TRUE

Generate Test Assertions Compare the analysis values in the source with the ones in the target Store the mismatches in a test result table Source_analysis_value Target_analysis_value  COUNT(Patient) Where (IsCurrent=TRUE) COUNT(Person) Test assertion for record count match: If Source_analysis_value > Target_analysis_value then Assertion_id=100 and Description=“Record count mismatch: target_table VS source_table” Test result table: Assertion ID Description 100 Record count mismatch: Person VS Patient

Empirical Evaluation Objectives Validation of ETL scripts Evaluation of fault finding ability of assertions

Validation of ETL Scripts Goal: Demonstrate that the ETL in the health data warehouse is correct with respect to the assertions Question: Do test assertions find faults that were previously undetected by the original developers? Metric: Number of faults detected by the assertions Be prepared to answer which type of assertion gets violated more often. Which property gets violated more often? Same with least often.

Validation of ETL Scripts (Cont.) Experiment subjects: Sources: Hospital A (72,652,442 records) with Caboodle data model on MSSQL Server Hospital B (32,726,923 records) with Caboodle data model on MSSQL Server Target: Data warehouse (127,714,325 records) with OMOP data model on Google BigQuery Results: Generated 44 test assertions Revealed 11 previously undetected faults Six assertion violations for Hospital A: Three average mismatches Two record count mismatches One outlier mismatch Five assertion violations for Hospital B: Four average mismatches Be prepared to answer which type of assertion gets violated more often. Which property gets violated more often? Same with least often.

Fault Finding Ability of Assertions Goal: Evaluate the ability of generated assertions to detect faults injected into mock data Question: Does at least one assertion fail as a result of each fault? Metric: Percentage of faults in mock data detected by test assertions Experiment subjects: Source: Mock database (900 records) with Caboodle data model on MSSQL Server Target: Data warehouse (112 records) with OMOP data model on Google BigQuery Results: Detected 92% of 118 injected faults All undetected faults resulted from mutating attributes in the target table that did not map to any attribute in the source table(s) All the assertions that should have failed actually failed

Discussion Properties Limitations of our evaluation Completeness Necessary but not sufficient Limitations of our evaluation Single ETL program used No comparison with other approaches Single test data generator

Conclusions Proposed a classification framework based on what is tested and how it is tested Identified gaps in the literature and proposed directions for future research Presented an approach to automatically create balancing tests Assertions could find previously undetected faults in the health data warehouse Assertions could detect faults injected into mock data in the target data warehouse

Future Work Define a structural format to specify source-to-target mappings in a consistent manner  Implement an algorithm to generate effective test assertions from these mappings. Develop an approach that aids developers in effectively localizing faults in the ETL process  Develop an approach that selects test cases to make regression testing more efficient than using a test-all process

Publications Hajar Homayouni, Sudipto Ghosh, and Indrakshi Ray. Data Warehouse Testing. Accepted to Publication in Advances in Computers, 2017. Hajar Homayouni, Sudipto Ghosh, and Indrakshi Ray. On Generating Balancing Tests for Validating the Extract-Transform-Load Process for Data Warehouse Systems. Submitted to 11th IEEE Conference on Software Testing, Validation and Verification, 2018.

Thank You

Mutation Analysis Property violations for each operator Description AR Add random record DR Delete random record MNF Modify numeric field MSF Modify string field MNFmin Modify min of numeric field MNFmax Modify max of numeric field MSFlength Modify string field length Mfnull Modify filed to null Property violations for each operator AR: record count and distinct record count match violation DR: record count and distinct record count match violation MNF: attribute value match and average match violation MSF: attribute value match and attribute length match violation MNFmin: outlier match and average match violation MNFmax: outlier match and average match violation MSFlength: attribute length match violation MFnull: attribute constraint match violation