Download presentation
Presentation is loading. Please wait.
Published byClement Heath Modified over 9 years ago
1
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann
2
Contents Approaches Towards Data Quality Example Data Integration A Generic Statistical Workflow Model Quality Assessment Conclusions 2Grossmann, Denk
3
Approaches Towards Data Quality The usual approach towards data quality is the Reporting View Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions Some frequently used dimensions: Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... 3Grossmann, Denk
4
Approaches Towards Data Quality These dimensions are many times broken down in sub-dimensions Example Accuracy: Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error,.... Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom 4Grossmann, Denk
5
Approaches Towards Data Quality If we have a number of different opportunities for data production such an approach is probably not the best one Compare the ideas of Total Quality Management (TQM) in industrial production: Systematic treatment of the influence of different production steps on quality of the final product We need a Processing View on data quality: How is data quality influenced by production? 5Grossmann, Denk
6
Approaches Towards Data Quality How can we arrive at a Processing View on data quality? We need a statistical workflow model We have to organize the processing information necessary for quality assessment in appropriate way C ompare (old) ideas of B. Sundgren about capture of metadata 6Grossmann, Denk
7
Approaches Towards Data Quality We have to know functions for assessing quality Output_Quality = F(Input_Quality, Processing_Quality) Such functions have to be applied according to The object we are interested in, e.g. a variable or a population or a classification The quality aspect we are interested in 7Grossmann, Denk
8
Example Data Integration Data integration occurs many times in statistical data production, in particular in case of data production from administrative sources It uses a number of operations usually understood as data pre-processing Basic goal: Combine information from two or more already existing data sets 8Grossmann, Denk
9
Example Data Integration Example for a Data Integration Dataflow Input → Integration → Post-alignment 9Grossmann, Denk
10
Example Data Integration Top level task description Match the datasets according matching key Align V1 (gender) Align V2 (status) 10Grossmann, Denk
11
Example Data Integration Details, Decisions to be made Are datasets appropriate? Quality of matching keys Quality of data sources Method for identification of matches? Method for handling ambiguities in V1 (Gender)? Method for imputation of V2 (Status)? How is quality measured At level of a summary measure? At level of a specific variable? At level of individual records? 11Grossmann, Denk
12
Example Data Integration There are no generally accepted standard tools and methods for answering such questions Probably we have to compare a number of alternative approaches Apply the generic format for different datasets Try different statistical methods and models Use different methods for quality assessment Traditional formulas Simulation based evaluation Assessment by using strategic surveys 12Grossmann, Denk
13
Example Data Integration Conclusion Different statistical methods may be an essential part of data production and quality assessment There is no longer such a clear distinction between “objective” data collection and statistical analysis Statistics generates added value beyond (administrative) accounting and IT 13Grossmann, Denk
14
A Generic Statistical Workflow Model Statistical Workflow: A mixture from Business Workflow (Process oriented) Scientific Workflow (Data oriented) Quality evaluation is the main control element of the process We have to consider the workflow at two levels Meta-level (Control of the process) Data-level (Production of data) 14Grossmann, Denk
15
A Generic Statistical Workflow Model Building blocks of the workflow model Transformations (Basic data operations) Process components (Tasks) defined by: Task definition Pre-Alignment Feasibility Check Main Transformation Post-Alignment Quality Evaluation Workflow (Sequence of Process components) 15Grossmann, Denk
16
A Generic Statistical Workflow Model Example for Data Integration Component Workflow 16Grossmann, Denk
17
A Generic Statistical Workflow Model In order to understand how statistics influences the boxes and data quality let us zoom into the box for post-alignment 17Grossmann, Denk
18
Quality Assessment For quality assessment we need a detailed description of the changes in meta-information during the dataflow 18Grossmann, Denk
19
Quality Assessment Example for meta- information flow in data integration Details for register based census in the presentation of Fiedler/Lenk in Session 26 (Thursday) 19Grossmann, Denk
20
Quality Assessment Example: Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example 20Grossmann, Denk
21
Quality Assessment V1 (Gender) Input Coincidence of matching keys in both datasets Matching of the variable Gender in both datasets Beliefs about quality of the variable in both sources Accuracy Assessment It seems that models developed in decision analysis (calculus from belief networks) are appropriate Alternatively we can use a strategic sample to check whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments 21Grossmann, Denk
22
Quality Assessment V2 (Status): Input Coincidence of matching keys in both datasets Reliability of the model used for imputation Measurement technique for quality of imputation Accuracy Assessment In this case we can apply traditional statistical techniques like false classification rate, ROC-curve, simulation 22Grossmann, Denk
23
Conclusions We have presented a model, which allows tighter coupling of quality assessment to the data production process Such a model seems useful if data production has more degrees of freedom What data should be used? What techniques should be used The approach allows identification of the different factors influencing quality 23Grossmann, Denk
24
Conclusions It allows formulation of precise questions about possible alternatives and defines new issues for research in statistical data quality Hopefully it helps to understand better the added value generated by statistics 24Grossmann, Denk
25
Thank you for attention 25Grossmann, Denk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.