Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.

Slides:



Advertisements
Similar presentations
High Resolution studies
Advertisements

Introduction: the New Price Index Manuals Presentation Points IMF Statistics Department.
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
Multiple Indicator Cluster Surveys Survey Design Workshop
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
1 Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs
Variance Estimation When Donor Imputation is Used to Fill in Missing Values Jean-François Beaumont and Cynthia Bocci Statistics Canada Third International.
1 Superior Safety in Noninferiority Trials David R. Bristol To appear in Biometrical Journal, 2005.
Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
1 Sharing best practices for the redesign of three business surveys Charles Tardif, Business Survey Methods Division,Statistics Canada presented at the.
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
  Refresher 5(2x - 3) Solving Equations 2x + 5 = x + 10 x + 5 = 10
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
Catherine Renne Insee Measuring sampling error in business surveys The case of the French monthly industry survey.
Improving imputation methodology in the Hungarian Central Statistical Office (HCSO) NTTS 2009 seminar, Bruxelles February 2009 Improving imputation.
0 - 0.
Addition Facts
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Introduction to Sampling : Censuses vs. Sample Surveys
Assumptions underlying regression analysis
SADC Course in Statistics (Session 20)
STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS
Clicker Quiz.
Chapter 4: Basic Estimation Techniques
Equivalence Partitioning
CHAPTER 16 Life Tables.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
1 FY10 ACS Methods Panel Update Jennifer Guarino Tancreto Chief, ACS Data Collection Methods Staff Decennial Statistical Studies Division Presentation.
Employer pays monthly fees for all employees Key disadvantage = Pay regardless of usage 100 employees x $15/m = $1,500/month 100 employees x.
General Linear Models The theory of general linear models posits that many statistical tests can be solved as a regression analysis, including t-tests.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 9) Slideshow: two-stage least squares Original citation: Dougherty, C. (2012) EC220.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.1 Chapter Five Data Collection and Sampling.
Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele.
Kate Sweeney, HSE Chief Statistician
Labour Force Historical Review Sandra Keys, University of Waterloo DLI OntarioTraining University of Guelph, Guelph, ON April 12, 2006.
1 MAXIMUM LIKELIHOOD ESTIMATION OF REGRESSION COEFFICIENTS X Y XiXi 11  1  +  2 X i Y =  1  +  2 X We will now apply the maximum likelihood principle.
Comparing Two Population Parameters
CORRECTION/COMPLETION OF RAINFALL DATA PRIMARY & SECONDARY VALIDATION –FLAGGING : SUSPECT OR INCORRECT VALUES –MISSING : NON-OBSERVANCE OR LOSS OF DATA.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Historical Changes in Stay-at-Home Mothers: 1969 to 2009 American Sociological Association Annual Meeting Atlanta, GA August 14-17, 2010 Rose M. Kreider,
DES Chapter 12 1 Chapter 12 Projecting Cash Flows for An Actual Company: Home Depot.
Chapter 11: The t Test for Two Related Samples
Multiple Regression and Model Building
Migration of a large survey onto a micro-economic platform Val Cox April 2014.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Arun Srivastava. Types of Non-sampling Errors Specification errors, Coverage errors, Measurement or response errors, Non-response errors and Processing.
Eurostat Statistical Data Editing and Imputation.
LSS Black Belt Training Forecasting. Forecasting Models Forecasting Techniques Qualitative Models Delphi Method Jury of Executive Opinion Sales Force.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
12th Meeting of the Group of Experts on Business Registers
Outlier Treatment in HCSO Present and future. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future.
Copyright 2010, The World Bank Group. All Rights Reserved. Managing Data Processing Section B.
Integrated Approach Processing Marie Brodeur Director General, Industry Statistics Branch, Statistics Canada St. Lucia February, 2014 SNA seminar in the.
The 2011 Census: Estimating the Population Alexa Courtney.
Study of Editing and Imputation Practices at Statistics Finland Janika Konnu and Pauli Ollila Statistics Finland Q2010: Editing session Wednesday 5 th.
1 Handbook on Population and Housing Census Editing Department of Economic and Social Development United Nations Statistics Division Studies in Methods,
Modeling approaches for the allocation of costs
The European Statistical Training Programme (ESTP)
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Treatment of Missing Data Pres. 8
Automatic Editing with Soft Edits
Chapter 13: Item nonresponse
Presentation transcript:

Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

22 Imputation Imputation resolves the problems of missing, invalid or incomplete responses identified during editing

3 Imputation Options Interactive/Manual Subjective imputation Donor based imputation Regression (model) based imputation Imputations can be done manually or automatically 3

4 Interactive/Manual Treatment Manual review of the record Obvious and easily corrected records can be interactively treated at the data capture stage – Ex: in a table formatted input, responses may be accidentally shifted by a row Often a subject matter expert reviews the hard copy/original questionnaire – Errors can be found in questionnaire that are otherwise undiscoverable – Manual imputation procedures, e.g. with historic data Re-contact respondent to correct data

5 Imputation Cells Usually, data is split into imputation cells similar to strata – Example criteria include industry type, geography, employment size, etc. Imputation cells are intended to be relatively homogeneous This ensure that imputations are done within similar respondents 5

6 Subjective Imputation Generally rule or logic based Can be used when there is only one (reasonably) possible response to the question – Ex: balance edit – single missing variable in a balance edit – Ex: rule based – if respondent reports zero months worked, then income can be imputed to be zero Can be used when missing/erroneous values can be determined unambiguously from edits – Ex: rule based – if the ratio of anticipated value (e.g. historic value) to current value is greater than 300, assume a thousands error. Value = 135,000 Previous value = 130

7 Donor Imputation Donor based – replacement by non- erroneous donors – Hot deck – replace with values from the current survey – Cold deck – replace with values from other source (e.g. previous surveys)

8 Donor Imputation – Substitution Historic value – Simple historic value is a cold deck imputation Historic value with trend – Trend can be based on growth in another variable within the record, variables in other records, etc. This is a very common imputation technique Suggestions Useful method when variables or growth rates are stable over time Less useful method when changes in variables are of primary interest – Ex: monthly employment in monthly employment surveys

9 Donor Imputation – Mean/Modal Missing value is replaced by the mean/modal of respondents for a variable (within a subset or imputation cell of similar respondents) – E.g. if wages is missing for one respondent, the average wage within the imputation cell can be used Suggestions Useful method when variance is small within an imputation cell

10 Donor Imputation – Nearest Neighbor For each missing value, find a donor value from a record that is closest to the missing value record based on the distance between a set of variables – E.g. Employees, Additions, Dismissals – Record to be imputed (t): E = 100, A = ?, D = ? – Donor record (s): 1.E = 80, A = 10, D = 5 : Distance = 20 2.E = 90, A = 12, D = 4 : Distance = 10 – Imputed record: E = 100, A = 12, D = 4 10

11 Donor Imputation – Nearest Neighbor(2)

12 Donor Imputation - Ratio Missing values are replaced with a ratio of donor record values – E.g.: T = P + C – Record to be imputed: T = 400, P = ?, C = ? – Donor record T = 100, P = 25, C = 75 – Imputed record T = 400, P = 100, C = 300 The donor can be: – Chosen using a distance function – The mean value within the imputation cell

Donor Imputation – Ratio (2) 13

14 Regression (model-based) Imputation Regression/model – An imputation model predicts a missing or erroneous value using a function of some auxiliary variables – Auxiliary variables can be from the current survey or other sources. E.g. sampling frame (size class, branch of economic activity), historical information (previous period value) – Regression coefficients can be determined from historic survey data

15 Model-based Imputation (2)

16 Imputation Process: Fellegi-Holt An isolated imputation may not satisfy all editing rules Key principle: the data of a record should be made to satisfy all edits by changing the fewest possible number of fields. Solves edit rules simultaneously through linear programming Advantages – Preserves as much original data as possible – Leads to consistent data satisfying all edits Disadvantages – All edits specified for a certain record are considered fatal – Powerful edits are required – Not easy to implement

17 How/Why to choose one method over the other? Depends on specificities of the survey and the available time, cost, expertise, etc. – Ex: a short term survey estimating changes in employment in the manufacturing sector, using historic data for employment would bias the estimate downwards When designing imputation processes, simulations using a variety of imputation techniques should be experimented with Fine tuning of imputation process to survey particulars is necessary