Download presentation
Presentation is loading. Please wait.
Published byOpal Cooper Modified over 9 years ago
1
UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative
2
Gathering Data Each data point to be collected is described with dimensions prior to collection Unique identifier is assigned to each data point/dimensional grouping Data is collected via surveys Data is inserted into the database by country/year for each survey returned Data goes through a cleaning process that involves both human and automated validation (ERS)
3
Data Encoding EMC_ID20062 ACTIVE1 EC_PRIO2 EC_UNIT210 EC_SECTO100000 EC_PRGDS100000 EC_GRADE100000 EC_AGE100000 EC_FIELD10000000 EC_FORGN100000 EC_ISCED10 EC_SEX100000 EC_PRGDU100000 EC_DGPOS100000 EC_PRGLO100000 EC_ADULT10 EMC_ID: Internal unique identifier used to store data. Each EMC_ID summarizes a set of dimension for data that we collect. In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored in separate dimension tables. For a more human legible format, each EMC_ID used in indicator definitions is also given an alphanumeric code that summarizes the dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1.
4
Raw Data Validation (ERS) Database (T-SQL) implementation. Stored procedures and reporting services Based on CONCEPTS Example: Concept: Redundant Data Check Description: UIS Surveys often have cells that are redundant in order to verify that the value entered in one cell is accurate and not the victim of a human input error Purpose: Verify that one cell equals another, redundant, cell Method: Validates that a specific “MASTER” cell is equal to any other redundant cell. Redundant cells are identified by having all dimensional values equal to the master cell with the exception of the PRIORITY dimension.
5
Preparing Indicators (transformations) Indicators are encoded in XML using extended MathML Resulting XML file can render in a friendly manner in any browser, providing immediate documentation Indicator XML file is “parsed” to convert the XML into database records Indicator definitions are validated when parsed to ensure completeness as well as the existence of any needed indicators
6
Indicator Definition Graduation age population Population d age de graduation (isc).(sex) thAge.(isc) thDur.(isc) 1 P.(age).(sex) Indicators are defined using MathML, with custom tags implemented by the UIS.
7
Indicator Definition (cont.) When loaded into a MathML enabled browser, the indicator definition becomes human readable and self documenting. Rendering the XML in a browser also helps to validate that the XML indicator specification is well formed.
8
Indicator Definition (cont.) A parser is then used to convert the XML indicator specification to a database structure for use in processing the transformations IndicCodeTermParentActionparentActionNtermsValueSequenceSource GAP.110offsetroot271 GAP.161doffset0P.Ag13POP GAP.171doffset0P.Ag24POP GAP.181doffset0P.Ag35POP GAP.191doffset0P.Ag46POP GAP.1101doffset0P.Ag57POP GAP.1111doffset0P.Ag68POP GAP.1121doffset0P.Ag79POP GAP.1131doffset0P.Ag810POP GAP.1141doffset0P.Ag911POP GAP.1151doffset0P.Ag1012POP GAP.1161doffset0P.Ag1113POP GAP.1171doffset0P.Ag1214POP GAP.1181doffset0P.Ag1315POP GAP.1191doffset0P.Ag1416POP GAP.1201doffset0P.Ag1517POP GAP.1211doffset0P.Ag1618POP GAP.1221doffset0P.Ag1719POP GAP.1231doffset0P.Ag1820POP GAP.1241doffset0P.Ag1921POP GAP.1251doffset0P.Ag2022POP GAP.1261doffset0P.Ag2123POP GAP.1271doffset0P.Ag2224POP GAP.1281doffset0P.Ag2325POP GAP.1291doffset0P.Ag2426POP GAP.1301doffset0P.Ag2527POP GAP.110021sumoffset21 GAP.110031002dsum0thAge.11EDU GAP.110041002dsum0thDur.12EDU GAP.110051coffset012
9
calcIndic Seasoned for 7 years Currently on 4 th version Entirely developed using database stored procedures and T-SQL Leverages well seasoned database functionality Data, indicator definitions and transformation code all in a single database. Fast.
10
calcIndic (part 2) -Indicator definitions are read -Each (data) or (indicator) tag is resolved by joining the required data point to the indicator definition for each country and year involved in the transformation -The steps for performing the calculation are performed based on the indicator definition -Data is written to domain-specific tables -Indicator validations are performed and problematic results are flagged. The reasons for each flag are logged to permit easy auditing.
11
User Defined Indicator Validation (DIVA) (in development) XML based. Validation rules for a particular indicator are defined alongside the indicator definition. MathML based with extended custom tags Validation process is SQL based As with the indicator definition, browser plugin makes the XML definition self- documenting
12
(isc).(sex) SAP.(isc).(sex) SAP.(isc) SAP.(isc).M SAP.(isc).F User Defined Indicator Validation (DIVA) (in development)
13
Dealing with missing/special data Both ERS and calcIndic allow for special processing of missing data Rules coding allow for custom treatment of special data Normal rule for formulas: “Special data” properties are viral. If you add a list of numbers together, and one value is “missing”, the sum will be “missing”. Normal rule for comparisons: Special data is only equal to similar special data (missing = missing).
14
Dealing with missing/special data (cont.) Specifying alternate processing rules possible on a case-by-case basis. When defining an indicator, each data point can have a rule specified to enable an alternate way of dealing with special data When defining a validation concept in ERS, each concept can have an alternate rule specified for comparisons
15
ERS: Example of special data rules for comparisons EQUAL 1 - Direct comparison between 2 cells Result Mastermissinginclusionnilnot applicablevalue missingTRUEFALSE inclusionFALSETRUEFALSE nilFALSE TRUEFALSE not applicableFALSE TRUEFALSE valueFALSE numeric Default Comparison Alternate Comparison for INCLUSION (when the data is included in the master cell) EQUAL 2 -Comparison between one cell (master) and a sum: The sum might be “inclusion”, because all is included in the master. Result Mastermissinginclusionnilnot applicablevalue missingTRUEFALSE inclusionFALSETRUEFALSE nilFALSE TRUEFALSE not applicableFALSE TRUEFALSE valueFALSETRUEFALSE numeric
16
calcIndic: Example of special data rules for calculations E.(isc).(age).(sex) By default, if the above data point is missing, the indicator calculated will also be labeled as missing. E.(isc).(age).(sex) The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now be considered ‘nil’ or 0
17
Future Development DIVA Ability to launch “on command”, instancing Ability to calculate only the indicators that are affected by an underlying data change
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.