A handbook on validation methodology Marco Di Zio Istat Workshop ValiDat Foundation – Wiesbaden, 10-11 November 2015
Underlying idea of the HB Why a handbook on methodology for data validation? Standardization of language, of elements, provide common measures for evaluation… establish a common reference framework and develop metrics for evaluating DV The HB is composed of two main parts: A generic framework for data validation Discuss metrics to evaluate a validation procedure (tuning, evaluating the procedure..) ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Generic framework for data validation The objective of this first section is to clarify What Why How and … ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Generic framework for data validation Clearly establish the relation with other phases of the statistical production process and internationals standards as GSBPM GSDEMs GSIM Describe the data validation life cycle – useful for managing the data validation process ValiDat foundation workshop - Wiesbaden 10-11 November 2015
What is data validation… Definition Data Validation is an activity verifying whether or not a combination of values is a member of a set of acceptable combinations. not far from the Unece definition: An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values but essentially different… ValiDat foundation workshop - Wiesbaden 10-11 November 2015
What… It is a decisional procedure ending with an acceptance or refusal of data as acceptable. The decisional procedure is generally based on rules expressing the acceptable combinations of values. ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Why do we perform data validation… The purpose of data validation is to ensure a certain level of quality of the final data but quality has several aspects. We clarified which aspects are related to DV Essentially the ones related the ‘structure of the data’, that are accuracy, comparability, coherence. But others are connected, e.g., timelines can be seen as a constraining factor ValiDat foundation workshop - Wiesbaden 10-11 November 2015
How to perform DV… Two main elements Validation levels to what extent a data set has been validated Validation rules Rules are applied to data, a failure of the rule implies that the corresponding validation level is not attained by the data at hand (decisional process: accept/not accept) ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Validation levels They are related to the perspective of the ‘validator’ … In the HB: Business perspective Starting form the elements characterising usually the DV process (increasing information) A formal approach Looking a the elements characterizing a point in a statistical setting ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Validation levels: business perspective ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Validation levels: formal approach metadata aspects that are necessary to identify a data point, The universe U from which a statistical object originates. (e.g., household, company,) The time t of selecting an element u from the current population p(t) The selected element u. This determines the value of variables X over time that may be observed. The variable selected for measurement. ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Data validation - GSDEMs Generic Statistical Data Editing Models statistical data editing composed of three different function types: Review, Selection and Amendment The review functions are defined as: Functions that examine the data to identify potential problems. This may be by evaluating formally specified quality measures or edit rules or by assessing the plausibility of the data in a less formal sense, for instance by using graphical displays ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Data validation - GSDEMs Among the GSDEMs different function categories there is ‘Review of data validity’ that is Functions that check the validity of data values against a specified range or a set of values and also the validity of specified combinations of values. Each check leads to a binary value (TRUE, FALSE) ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Data Validation - GSBPM ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Data validation life cycle ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Second part of the document: Metrics Evaluating validation procedure …next presentation… ValiDat foundation workshop - Wiesbaden 10-11 November 2015
Thanks for your attention ValiDat foundation workshop - Wiesbaden 10-11 November 2015