New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa.

New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Introduction  New methods of data editing and imputation  Subdivided into 5 different themes: Automatic editing Imputation E & I for demographic variables Selective editing Software

Invited Papers WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) WP 31: Smoothing Imputations for categorical data in the linear regression paradigm (USCB, US)

Automatic editing: papers (1/2) Six papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) WP 33: Data editing and logic (Australia)

Automatic editing: papers (2/2) WP 43: Automatic editing system for the case of two short-term business surveys (Republic of Slovenia) WP 44: A variable neighbourhood local search approach for the continuous data editing problem (Spain) WP 46: Implicit linear inequality edits and error localization in the SPEER edit system (USCB, US)

Automatic Editing: main developments Methods based on Fellegi-Holt model  Developments at SORS General system combines error localization with outlier detection Plans for automation of implied edit generation  Further improvements of SPEER Preprocessing program for generation of implied edits Improve error localization

 Framework of Fellegi-Holt theory in propositional logic Generation of implied edits framed as logical deduction Automatic tools that can potentially be used for finding minimal deletion set Automatic Editing: main developments

Methods based on some other approach  Erroneous unit measures Model as cluster analysis problem  Ratio and balance constraints Hybrid ratio editing and quadratic programming Controlled rounding  Error localization as a combinatorial optimization problem Continuous data Successful on very large data sets

Imputation: papers (1/2) Six papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 31: Smoothing imputations for categorical data in the linear regression paradigm (USCB, US) WP 36: Integrated modeling approach to imputation and discussion on imputation variance (Statistics Finland)

Imputation: papers (2/2) WP 40: Imputation of data subject to balance and inequality restrictions using the truncated normal distribution (Statistics Netherlands) WP 41: On the imputation of categorical data subject to edit restrictions using loglinear models (Statistics Netherlands) WP 48: Improving imputation: the plan to examine count, status, vacancy and item imputation in the decennial census (USCB, US)

Imputation: main developments Model based methods  Discrete Data Constrained loglinear model Linear regression model  Continuous Data Truncated normal distribution followed by MCEM

Imputation: main developments Implementation of imputation methods  Use Bayesian networks for imputation of discrete data  Development of QUIS for imputation of continuous data written in SAS uses EM algorithm, nearest neighbor, and MI

Imputation: main developments Implementation of imputation methods  Integrated Modeling Approach (IMAI) Summary and analysis of principles of IMAI Estimation of imputation variance  U.S. Decennial Census Research on alternative imputation options Administrative records, model based imputation, CANCEIS, hot deck Development of a truth deck for evaluation

E & I for demographic variables: papers Three papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 35: Edit and imputation for the 2006 Canadian Census (Statistics Canada) WP 38: New procedures for editing and imputation of demographic variables (ISTAT, Italy)

E & I for demographic variables: main developments  Further improvement of CANCEIS capability of processing all census variables improved editing and imputation of alphanumeric, discrete, continuous and coded variables improved user interface  Development of DIESIS combined use of “data driven” approach (NIM) and “minimum change” approach (Fellegi-Holt)

E & I for demographic variables: main developments  Development of DIESIS Use of graph theory to improve quality of sequential imputation Optimization procedure to locate the household reference person New approach for selection of donors  based on partitioning passed records into smaller subsets of similar characteristics  search for donor records within the smaller clusters

Selective editing: papers Two papers: WP 42: Evaluation of score functions for selective editing of annual structural business statistics (Statistics Netherlands) WP 45: An editing procedure for low pay data in the annual survey of hours and earning (Office for National Statistics, UK)

Selective editing: main developments  Continued use and development of selective editing  Evaluation of selective editing approaches experiments with different sets of score functions  Development of “hybrid editing” validate a sample of failed records use associated data to impute remaining records

Software: papers Four papers: WP 34: The transition from GEIS to BANFF (Statistics Canada) WP 37: Concepts, materials and IT modules for data editing of German statistics (Destatis, Germany) WP 39: SLICE 1.5: a software framework for automatic edit and imputation (Statistics Netherlands) WP 47: Improving an edit and imputation system for the US Census of agriculture (NASS, US)

Software: main developments  Flexibility modules rather than large systems are developed standard statistical packages are used (SAS in BANFF and US Census of Agriculture)  Testing and implementation of the software  Quality control measures e.g. for (donor) imputation  Integration of the edit and imputation software in entire production process process chain: planning, data collection, edit and imputation

General points for discussion  Are there any really new approaches? new approaches extensions of existing ideas? new approaches combinations of old ones?  Develop new approaches or consolidate old approaches? development versus evaluation studies and testing prototype software versus implementation of production software  Is our focus shifting? from editing towards imputation? from development towards implementation? from computational aspects towards quality issues?

Automatic editing: points for discussion  Can operations research techniques be combined with techniques from mathematical logic?  What are the (dis)advantages of using SAT solvers when compare to direct integer programming methods?  What is the quality of the imputations when editing data using the quadratic programming approach?

Automatic editing: points for discussion  What is the quality of the solutions found by using the combinatorial optimization approach on real survey data? How fast is this approach on realistic data?  Can finite mixture models be used for detection of other types of systematic errors?  Should we invest on developing generic tools or software tools tailored to a particular application?

Automatic editing: points for discussion  Are there any other types of surveys that are worth the effort of generating implied edits prior to error localization?  What are the most cost-effective methods for edit/imputation in terms of resources, time, clerical intervention, quality of results?

Imputation: points for discussion  What are the (dis)advantages of using complex mathematical models for missing data imputation? Are these models too complex for survey practitioners?  What are the expected computational difficulties of applying complex models to real survey data?  What are the largest (most complex) surveys that can be imputed using these models?

Imputation: points for discussion  What is the quality of the imputations carried out using model based methods for filling-in missing data?  Can we compare the different imputation models?

Imputation: points for discussion  Can more guidelines for the IMAI process be developed?  To what extent can we develop a systematic way of applying IMAI?  Is imputation variance an important issue at the moment, or should we (still) focus on imputation bias?

E & I for demographic variables: points for discussion  Can CANCEIS/DIESIS be used for other data besides demographic census data?  Can CANCEIS/DIESIS be further developed?  Should we use a combination of edit and imputation methods or a single method for demographic variables?

Selective editing: points for discussion  Can selective editing be successfully applied to large/complex surveys?  Can current methods for selective editing be further developed?  Can a general theory for selective editing be developed?  How promising is hybrid editing?

Software: points for discussion  Should we develop generic software or software tools for particular applications?  How can we ensure the flexibility of software?  Are the software tools fast enough for large/complex data sets?  To what extent should we aim to automate the editing process?

New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa.

Similar presentations

Presentation on theme: "New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa.

Similar presentations

Presentation on theme: "New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa."— Presentation transcript:

Similar presentations

About project

Feedback