Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Similar presentations


Presentation on theme: "Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann."— Presentation transcript:

1 Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann

2 Contents  Approaches Towards Data Quality  Example Data Integration  A Generic Statistical Workflow Model  Quality Assessment  Conclusions 2Grossmann, Denk

3 Approaches Towards Data Quality  The usual approach towards data quality is the Reporting View  Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions Some frequently used dimensions: Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... 3Grossmann, Denk

4 Approaches Towards Data Quality  These dimensions are many times broken down in sub-dimensions Example Accuracy: Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error,....  Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom 4Grossmann, Denk

5 Approaches Towards Data Quality  If we have a number of different opportunities for data production such an approach is probably not the best one  Compare the ideas of Total Quality Management (TQM) in industrial production: Systematic treatment of the influence of different production steps on quality of the final product  We need a Processing View on data quality: How is data quality influenced by production? 5Grossmann, Denk

6 Approaches Towards Data Quality  How can we arrive at a Processing View on data quality?  We need a statistical workflow model  We have to organize the processing information necessary for quality assessment in appropriate way C ompare (old) ideas of B. Sundgren about capture of metadata 6Grossmann, Denk

7 Approaches Towards Data Quality  We have to know functions for assessing quality Output_Quality = F(Input_Quality, Processing_Quality)  Such functions have to be applied according to The object we are interested in, e.g. a variable or a population or a classification The quality aspect we are interested in 7Grossmann, Denk

8 Example Data Integration  Data integration occurs many times in statistical data production, in particular in case of data production from administrative sources  It uses a number of operations usually understood as data pre-processing  Basic goal: Combine information from two or more already existing data sets 8Grossmann, Denk

9 Example Data Integration  Example for a Data Integration Dataflow Input → Integration → Post-alignment 9Grossmann, Denk

10 Example Data Integration  Top level task description  Match the datasets according matching key  Align V1 (gender)  Align V2 (status) 10Grossmann, Denk

11 Example Data Integration  Details, Decisions to be made  Are datasets appropriate? Quality of matching keys Quality of data sources  Method for identification of matches?  Method for handling ambiguities in V1 (Gender)?  Method for imputation of V2 (Status)?  How is quality measured At level of a summary measure? At level of a specific variable? At level of individual records? 11Grossmann, Denk

12 Example Data Integration  There are no generally accepted standard tools and methods for answering such questions  Probably we have to compare a number of alternative approaches  Apply the generic format for different datasets  Try different statistical methods and models  Use different methods for quality assessment Traditional formulas Simulation based evaluation Assessment by using strategic surveys 12Grossmann, Denk

13 Example Data Integration  Conclusion  Different statistical methods may be an essential part of data production and quality assessment  There is no longer such a clear distinction between “objective” data collection and statistical analysis  Statistics generates added value beyond (administrative) accounting and IT 13Grossmann, Denk

14 A Generic Statistical Workflow Model  Statistical Workflow: A mixture from  Business Workflow (Process oriented)  Scientific Workflow (Data oriented)  Quality evaluation is the main control element of the process  We have to consider the workflow at two levels  Meta-level (Control of the process)  Data-level (Production of data) 14Grossmann, Denk

15 A Generic Statistical Workflow Model  Building blocks of the workflow model  Transformations (Basic data operations)  Process components (Tasks) defined by: Task definition Pre-Alignment Feasibility Check Main Transformation Post-Alignment Quality Evaluation  Workflow (Sequence of Process components) 15Grossmann, Denk

16 A Generic Statistical Workflow Model  Example for Data Integration Component Workflow 16Grossmann, Denk

17 A Generic Statistical Workflow Model  In order to understand how statistics influences the boxes and data quality let us zoom into the box for post-alignment 17Grossmann, Denk

18 Quality Assessment  For quality assessment we need a detailed description of the changes in meta-information during the dataflow 18Grossmann, Denk

19 Quality Assessment  Example for meta- information flow in data integration  Details for register based census in the presentation of Fiedler/Lenk in Session 26 (Thursday) 19Grossmann, Denk

20 Quality Assessment  Example: Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example 20Grossmann, Denk

21 Quality Assessment  V1 (Gender)  Input Coincidence of matching keys in both datasets Matching of the variable Gender in both datasets Beliefs about quality of the variable in both sources  Accuracy Assessment It seems that models developed in decision analysis (calculus from belief networks) are appropriate Alternatively we can use a strategic sample to check whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments 21Grossmann, Denk

22 Quality Assessment  V2 (Status):  Input Coincidence of matching keys in both datasets Reliability of the model used for imputation Measurement technique for quality of imputation  Accuracy Assessment In this case we can apply traditional statistical techniques like false classification rate, ROC-curve, simulation 22Grossmann, Denk

23 Conclusions  We have presented a model, which allows tighter coupling of quality assessment to the data production process  Such a model seems useful if data production has more degrees of freedom  What data should be used?  What techniques should be used  The approach allows identification of the different factors influencing quality 23Grossmann, Denk

24 Conclusions  It allows formulation of precise questions about possible alternatives and defines new issues for research in statistical data quality  Hopefully it helps to understand better the added value generated by statistics 24Grossmann, Denk

25 Thank you for attention 25Grossmann, Denk


Download ppt "Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann."

Similar presentations


Ads by Google