Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.

Slides:



Advertisements
Similar presentations
Non response and missing data in longitudinal surveys.
Advertisements

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Statistical Methods Chichang Jou Tamkang University.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Statistical Background
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Today Concepts underlying inferential statistics
Multivariate Analysis Techniques
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Hypothesis Testing in Linear Regression Analysis
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Copyright © 2011 Pearson Education, Inc. Analysis of Variance Chapter 26.
Comments: The Big Picture for Small Areas Alan M. Zaslavsky Harvard Medical School.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Correlation & Regression Analysis
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Gaussian Processes For Regression, Classification, and Prediction.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
1 A prediction approach to representative sampling Ib Thomsen & Li-Chun Zhang Statistics Norway
Sampling Theory and Some Important Sampling Distributions.
CLASSICAL NORMAL LINEAR REGRESSION MODEL (CNLRM )
Tutorial I: Missing Value Analysis
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
1 General Recommendations of the DIME Task Force on Accuracy WG on HBS, Luxembourg, 13 May 2011.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Econometrics III Evgeniya Anatolievna Kolomak, Professor.
Multiple Regression Reference: Chapter 18 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
1 1 Statistical registers by restricted neighbor imputation – An application to the Norwegian Agriculture Survey Nina Hagesæther and Li-Chun Zhang Statistics.
Stats Methods at IC Lecture 3: Regression.
Multiple Regression.
Chapter 7. Classification and Prediction
Chapter 12 Simple Linear Regression and Correlation
Multiple Imputation using SOLAS for Missing Data Analysis
Statistics in MSmcDESPOT
Chapter 12 Using Descriptive Analysis, Performing
How to handle missing data values
Correlation and Regression
Multiple Regression.
Chapter 12 Simple Linear Regression and Correlation
OVERVIEW OF LINEAR MODELS
Product moment correlation
Non response and missing data in longitudinal surveys
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen

ISEE model: A sketch

Use of a statistical register  Combining administrative and survey data Model-based prediction or weighting Construction of statistical registers  Uses of a statistical register Prediction of (sub-)population totals Multiple uses & general database quality => inferential concerns associated with imputation  How to balance between the two types inferential concerns?

A triple-goal criterion for statistical registers A.Effisicient population totals of interest B.Correct co-variances among survey variables, as well as between survey and auxiliary variables C.Non-stochastic & constant tabulation

A simultaneous prediction method  NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR).  A simultaneous prediction method Values are generated outside of the sample Efficient for prediction of population totals Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.

About NNI-WR  Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods  Solves variance estimation problem at the same time  Genuine multivariate imputation with realistic imputed values  Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches  NNI can be made non-stochastic, yielding constant tabulations on repetition

An algorithm and current research  An algorithm Jump-start phase: to speed up the imputation procedure if desirable Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains Adjustment between the two phases  Current research How well does the algorithm perform in real statistical productions? Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation? Evaluation of micro-data quality

Background information: Some standard methods of prediction and imputation

Basic prediction approach  Under the general linear model: Target parameter T = linear combination of y- values in the population Estimation of T  Prediction of T outside of the selected sample Prediction of individuals: A special case  Main problems for a statistical register Lack of natural variation in data; especially if many units have the same x-values Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation

Random regression imputation (RRI)  To emulate the natural variation in data: Add a random residual to the best predicted y-value  Hot-deck as a special case  Main problems: Extra variance of imputed estimator due to random imputation => never fully efficient Random imputation not the only means for creating natural variation in data Different tabulations on repetition => lack of acceptability and face-value in official statist.

Multiple imputation (MI)  Independent random imputations + formulae for combining results  Bayesian or frequentist approach  Main problems: Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations A common misunderstanding: only MI can yield acceptable measures of accuracy.

Predictive mean matching (PMM)  Find the donor among the observed units who has the same predict y- value & impute the observed y-value  Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance.  Essentially a marginal, variable-by- variable approach

Nearest neighbor imputation (NNI)  Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit.  A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models.  Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases.  Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code.  Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)

Artificial neural network (ANN)  Class of functional imputation  ANN as generalized regression functions (Bishop, 1995)  No analytic predictor  Unrealistic imputed values for categorical variables of interest  Usually not fully efficient