UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative.

Slides:



Advertisements
Similar presentations
Preparing Data for Quantitative Analysis
Advertisements

Pengolahan dan Analisa Data Indra Budi Fasilkom UI.
Tutorial 12: Enhancing Excel with Visual Basic for Applications
History Leading to XHTML
Introduction to Structured Query Language (SQL)
U of R eXtensible Catalog Team MetaCat. Problem Domain.
A Guide to SQL, Seventh Edition. Objectives Embed SQL commands in PL/SQL programs Retrieve single rows using embedded SQL Update a table using embedded.
ASP.NET Programming with C# and SQL Server First Edition
Introduction to Structured Query Language (SQL)
Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky Presented by Ramiro Lopez.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Overview of Previous Lesson(s) Over View  ASP.NET Pages  Modular in nature and divided into the core sections  Page directives  Code Section  Page.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Chapter 7 Structuring System Process Requirements
Excel Projects 5 & 6 Notes Mr. Ursone. Excel Project 5: Sorting a List  Sorting: Arranging records in a specific sequence  The Sort command is on the.
Database Technical Session By: Prof. Adarsh Patel.
OCAN College Access Program Data Submissions Vonetta Woods HEI Analyst, Ohio Board of Regents
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
ABC Insurance Co. Paul Barry Steve Randolph Jing Zhou CSC8490 Database Systems & File Management Dr. Goelman Villanova University August 2, 2004.
Copyright © 2007 Pearson Education Canada 1 Chapter 13: Audit of the Sales and Collection Cycle: Tests of Controls.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
File Systems and Databases Lecture 1. Files and Databases File: A collection of records or documents dealing with one organization, person, area or subject.
Microsoft Access Database Software.
1 Database Concepts 2 Definition of a Database An organized Collection Of related records.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Views In some cases, it is not desirable for all users to see the entire logical model (that is, all the actual relations stored in the database.) In some.
USING XML AS A DATA SOURCE. Data binding is a process by which information in a data source is stored as an object in computer memory. In this presentation,
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
ITGS Databases.
What have we learned?. What is a database? An organized collection of related data.
Oracle 11g: SQL Chapter 4 Constraints.
Chapter 4 Constraints Oracle 10g: SQL. Oracle 10g: SQL 2 Objectives Explain the purpose of constraints in a table Distinguish among PRIMARY KEY, FOREIGN.
DBT544. DB2/400 Advanced Features Level Check Considerations Database Constraints File Overrides Object and Record Locks Trigger Programs.
Advanced Accounting Information Systems Day 10 answers Organizing and Manipulating Data September 16, 2009.
A Guide to SQL, Eighth Edition Chapter Eight SQL Functions and Procedures.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Chapter 10 Designing Databases. Objectives:  Define key database design terms.  Explain the role of database design in the IS development process. 
Session 1 Module 1: Introduction to Data Integrity
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
(Winter 2016) Instructor: Craig Duckett Lecture 13: Thursday, February 18 th Mere Mortals: Chap. 9 Summary, Team Work 1.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
School Census Spring 2014 Application Jim Haywood Product Manager for Statutory Returns Version 1.0.
IMS 4212: Constraints & Triggers 1 Dr. Lawrence West, Management Dept., University of Central Florida Stored Procedures in SQL Server.
HEI/OCAN College Access Program Data Submissions.
Registration and Eligibility Checklist Set Up in OPEN Diana Meyer OPEN Administrator.
CHAPTER 7 LESSON C Creating Database Reports. Lesson C Objectives  Display image data in a report  Manually create queries and data links  Create summary.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Excel Tutorial 8 Developing an Excel Application
(Winter 2017) Instructor: Craig Duckett
Logical Database Design and the Rational Model
A Guide to SQL, Seventh Edition
GO! with Microsoft Office 2016
Microsoft Office Access 2010 Lab 2
(Winter 2017) Instructor: Craig Duckett
Information Systems Today: Managing in the Digital World
MIS 322 – Enterprise Business Process Analysis
GO! with Microsoft Access 2016
Modern Systems Analysis and Design Third Edition
Lecture 1 File Systems and Databases.
Using Use Case Diagrams
Web Development Using ASP .NET
Chapter 11 Describing Process Specifications and Structured Decisions
The ultimate in data organization
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative

Gathering Data Each data point to be collected is described with dimensions prior to collection Unique identifier is assigned to each data point/dimensional grouping Data is collected via surveys Data is inserted into the database by country/year for each survey returned Data goes through a cleaning process that involves both human and automated validation (ERS)

Data Encoding EMC_ID20062 ACTIVE1 EC_PRIO2 EC_UNIT210 EC_SECTO EC_PRGDS EC_GRADE EC_AGE EC_FIELD EC_FORGN EC_ISCED10 EC_SEX EC_PRGDU EC_DGPOS EC_PRGLO EC_ADULT10 EMC_ID: Internal unique identifier used to store data. Each EMC_ID summarizes a set of dimension for data that we collect. In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored in separate dimension tables. For a more human legible format, each EMC_ID used in indicator definitions is also given an alphanumeric code that summarizes the dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1.

Raw Data Validation (ERS) Database (T-SQL) implementation. Stored procedures and reporting services Based on CONCEPTS Example: Concept: Redundant Data Check Description: UIS Surveys often have cells that are redundant in order to verify that the value entered in one cell is accurate and not the victim of a human input error Purpose: Verify that one cell equals another, redundant, cell Method: Validates that a specific “MASTER” cell is equal to any other redundant cell. Redundant cells are identified by having all dimensional values equal to the master cell with the exception of the PRIORITY dimension.

Preparing Indicators (transformations) Indicators are encoded in XML using extended MathML Resulting XML file can render in a friendly manner in any browser, providing immediate documentation Indicator XML file is “parsed” to convert the XML into database records Indicator definitions are validated when parsed to ensure completeness as well as the existence of any needed indicators

Indicator Definition Graduation age population Population d age de graduation (isc).(sex) thAge.(isc) thDur.(isc) 1 P.(age).(sex) Indicators are defined using MathML, with custom tags implemented by the UIS.

Indicator Definition (cont.) When loaded into a MathML enabled browser, the indicator definition becomes human readable and self documenting. Rendering the XML in a browser also helps to validate that the XML indicator specification is well formed.

Indicator Definition (cont.) A parser is then used to convert the XML indicator specification to a database structure for use in processing the transformations IndicCodeTermParentActionparentActionNtermsValueSequenceSource GAP.110offsetroot271 GAP.161doffset0P.Ag13POP GAP.171doffset0P.Ag24POP GAP.181doffset0P.Ag35POP GAP.191doffset0P.Ag46POP GAP.1101doffset0P.Ag57POP GAP.1111doffset0P.Ag68POP GAP.1121doffset0P.Ag79POP GAP.1131doffset0P.Ag810POP GAP.1141doffset0P.Ag911POP GAP.1151doffset0P.Ag1012POP GAP.1161doffset0P.Ag1113POP GAP.1171doffset0P.Ag1214POP GAP.1181doffset0P.Ag1315POP GAP.1191doffset0P.Ag1416POP GAP.1201doffset0P.Ag1517POP GAP.1211doffset0P.Ag1618POP GAP.1221doffset0P.Ag1719POP GAP.1231doffset0P.Ag1820POP GAP.1241doffset0P.Ag1921POP GAP.1251doffset0P.Ag2022POP GAP.1261doffset0P.Ag2123POP GAP.1271doffset0P.Ag2224POP GAP.1281doffset0P.Ag2325POP GAP.1291doffset0P.Ag2426POP GAP.1301doffset0P.Ag2527POP GAP sumoffset21 GAP dsum0thAge.11EDU GAP dsum0thDur.12EDU GAP coffset012

calcIndic Seasoned for 7 years Currently on 4 th version Entirely developed using database stored procedures and T-SQL Leverages well seasoned database functionality Data, indicator definitions and transformation code all in a single database. Fast.

calcIndic (part 2) -Indicator definitions are read -Each (data) or (indicator) tag is resolved by joining the required data point to the indicator definition for each country and year involved in the transformation -The steps for performing the calculation are performed based on the indicator definition -Data is written to domain-specific tables -Indicator validations are performed and problematic results are flagged. The reasons for each flag are logged to permit easy auditing.

User Defined Indicator Validation (DIVA) (in development) XML based. Validation rules for a particular indicator are defined alongside the indicator definition. MathML based with extended custom tags Validation process is SQL based As with the indicator definition, browser plugin makes the XML definition self- documenting

(isc).(sex) SAP.(isc).(sex) SAP.(isc) SAP.(isc).M SAP.(isc).F User Defined Indicator Validation (DIVA) (in development)

Dealing with missing/special data Both ERS and calcIndic allow for special processing of missing data Rules coding allow for custom treatment of special data Normal rule for formulas: “Special data” properties are viral. If you add a list of numbers together, and one value is “missing”, the sum will be “missing”. Normal rule for comparisons: Special data is only equal to similar special data (missing = missing).

Dealing with missing/special data (cont.) Specifying alternate processing rules possible on a case-by-case basis. When defining an indicator, each data point can have a rule specified to enable an alternate way of dealing with special data When defining a validation concept in ERS, each concept can have an alternate rule specified for comparisons

ERS: Example of special data rules for comparisons EQUAL 1 - Direct comparison between 2 cells Result Mastermissinginclusionnilnot applicablevalue missingTRUEFALSE inclusionFALSETRUEFALSE nilFALSE TRUEFALSE not applicableFALSE TRUEFALSE valueFALSE numeric Default Comparison Alternate Comparison for INCLUSION (when the data is included in the master cell) EQUAL 2 -Comparison between one cell (master) and a sum: The sum might be “inclusion”, because all is included in the master. Result Mastermissinginclusionnilnot applicablevalue missingTRUEFALSE inclusionFALSETRUEFALSE nilFALSE TRUEFALSE not applicableFALSE TRUEFALSE valueFALSETRUEFALSE numeric

calcIndic: Example of special data rules for calculations E.(isc).(age).(sex) By default, if the above data point is missing, the indicator calculated will also be labeled as missing. E.(isc).(age).(sex) The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now be considered ‘nil’ or 0

Future Development DIVA Ability to launch “on command”, instancing Ability to calculate only the indicators that are affected by an underlying data change