CountrySTAT Team-I 10-13 November 2014, ECO Secretariat,Teheran.

Slides:



Advertisements
Similar presentations
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Advertisements

Chapter 5 Multiple Linear Regression
Flexible smoothing with B-splines and Penalties or P-splines P-splines = B-splines + Penalization Applications : Generalized Linear and non linear Modelling.
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Qualitative Forecasting Methods
Chapter 12 Simple Regression
Curve-Fitting Regression
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Some standard univariate probability distributions
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Diane Stockton Trend analysis. Introduction Why do we want to look at trends over time? –To see how things have changed What is the information used for?
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Eurostat Statistical Data Editing and Imputation.
STAT 3130 Statistical Methods II Missing Data and Imputation.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Xavier Sala-i-Martin Columbia University June 2008.
Some standard univariate probability distributions Characteristic function, moment generating function, cumulant generating functions Discrete distribution.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Forecasting Professor Ahmadi.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Curve-Fitting Regression
Operations Management For Competitive Advantage 1Forecasting Operations Management For Competitive Advantage Chapter 11.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Time Series Analysis and Forecasting
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Regression Regression relationship = trend + scatter
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Forecasting Operations Management For Competitive Advantage.
Demand Management and Forecasting Module IV. Two Approaches in Demand Management Active approach to influence demand Passive approach to respond to changing.
1 1 Slide Forecasting Professor Ahmadi. 2 2 Slide Learning Objectives n Understand when to use various types of forecasting models and the time horizon.
Spatial Analysis & Geostatistics Methods of Interpolation Linear interpolation using an equation to compute z at any point on a triangle.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Sampling Design and Analysis MTH 494 Lecture-22 Ossam Chohan Assistant Professor CIIT Abbottabad.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Sampling and estimation Petter Mostad
R. Ty Jones Director of Institutional Research Columbia Basin College PNAIRP Annual Conference Portland, Oregon November 7, 2012 R. Ty Jones Director of.
Tutorial I: Missing Value Analysis
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Model based approach for estimating and forecasting crop statistics: Update, consolidation and improvement of AGROMET model “AGROMET Project” Working Group.
Design and Analysis of Experiments (5) Fitting Regression Models Kyung-Ho Park.
Welcome to MM305 Unit 5 Seminar Dr. Bob Forecasting.
Missing data: Why you should care about it and what to do about it
Chapter 7. Classification and Prediction
Theme (ii): New Data Sources and Census
Modeling approaches for the allocation of costs
Introduction to Survey Data Analysis
Multiple Imputation.
How to handle missing data values
HISTORICAL AND CURRENT PROJECTIONS
L. Isella, A. Karvounaraki (JRC) D. Karlis (AUEB)
Sampling Studies for Longitudinal Functional Data
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar
The European Statistical Training Programme (ESTP)
EM for Inference in MV Data
Analytics – Statistical Approaches
EM for Inference in MV Data
Cases. Simple Regression Linear Multiple Regression.
Chapter 13: Item nonresponse
Presentation transcript:

CountrySTAT Team-I November 2014, ECO Secretariat,Teheran

 Introduction  Origin of missing data  Nature of missing data  Implemented methodologies  Proposed methodologies  Results  Conclusion

 The objective of this presentation is to introduce basics tools to handle missing data in CountrySTAT and FAOSTAT domains. They are based on simple and friendly approach, easy to use.  The CountrySTAT agricultural production domain was used as a basis to develop and test imputation and validation methodologies that could assist in standardisation across the different statistical domains presents at FAO level.

 Data are missing for different reasons 1) The value has not been measured (forget...); 2) The value is measured but lost; 3) The value is measured, but considered unusable (outliers, etc.); 4) The value is measured but unavailable.

 In a dataset, data can be 1) Missing completely at random (MCAR): when the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. P(r |Y observed ;Y missing ) = P(r ) 2) Missing at random (MAR): when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data. P(r |Y observed ;Y missing ) = P(r |Y observed ) 3) Not missing at random (NMAR): when data are not MCAR or MAR P(r |Y observed ;Y missing ) = P(r |Y observed ;Y missing ) 4) Censored and Truncated Data. Data use to be MCAR or MAR

 A) Deductive or logical imputation;  B) Mean imputation;  C) Ratio imputation;  D) Regression imputation;  E) Donor imputation (hot-deck, cold-deck, nearest neighbor) ;  F) Multiple imputation : Because it is not deterministic, it is not applicable to officials statistics.

 expert judgment  last observations carried forward  linear interpolation  growth-rate benchmarking yield estimation multivariate approach These imputations are based on deductive or logical imputation, ratio imputation and donor imputation. The selected method is based on Regression imputation method. WHY? already applied under development trend smoothing tested but not applied

Year Area

A linear trend is assumed to exist between the start- and endpoints of gaps in the time series. Let y 0, y 1,..., y t-l denote the data points with values obtained from official sources before the gap and y t+r, y t+r+1,..., y m denote the data points with official values after the gap. The imputed values are calculated as:

Year Area

Year Area Production

 Used methods are based on regression imputation and used EM-algorithm : 1)Yield estimation: estimate yield using an arima model; 2)Linear regression: Use a linear regression between P t and A t including Trend; 3)Arima model: Estimate P t and A t using ARIMA model; 4) Spline regression: Estimate P t and A t using spline;

 How it is work ?

 Compute a yield time series Y t containing missing data:  Y t =P t /A t, where P t is the production and A t is the area harvested at time t;  Use linear interpolation method to obtain starting values;  ARIMA(0,1,1): Y t =Y t-1 + α+ ε t - θ 1 * ε t-1 ;  EM algorithm.  Use Yield estimate to impute Production and Area Harvested.  Where P t and A t are missing, we use last observation carried forward method to impute area harvested.

 The model assumes linear relationship between Production and Area Harvested;  P t = Y t *A t  P t = Production in the year t;  A t = Area Harvested in the year t;  Y t = Yield in the year t.  Algorithm:  1) Linear interpolation for Area for starting values;  2) Repeat and update until the convergence of prediction values:  P t = α+ β 1 *Trend + β 2 *A t + ε t (EM-Algorithm to impute P t )  A t = α+ β 1 *Trend + β 2 *P t + ε t (EM-Algorithm to impute A t )

 The ARIMA models must be identified  ARIMA(0,1,1): Y t =Y t-1 + α+ ε t - θ 1 * ε t-1 ;  Use relation between Production and Area  Use these variable as time series and Impute using EM- algorithm.  Package mtsdi of R.  Impute using ARIMA model for Pt and At imputation

 Form of interpolation where the interpolant is a special type of piecewise polynomial called a spline.  For each interval, we try estimate a polynomial function which fit well data.  Spline interpolation is preferred over polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline.  Package mtsdi of R.  Impute using Spline regression for Pt and At imputation

 We use reals data to test proposed methodologies: Yield estimation, Linear Regression, ARIMA, Spline  We add also linear interpolation  Data are from CountrySTAT-Mali website.  Missing data are generated randomly.  Data are from 1984 to Use real data to test.

 Test case: Maize.  Missing data at 10 %.

 We perform again these methods on the same dataset at different percentages of missing data.

% MissingMethodMinMaxMeanStd.Dev 10 Linear.Int. Yield Linear Reg. ARIMA Spline Linear.Int. Yield Linear Reg. ARIMA Spline Linear.Int. Yield Linear Reg. ARIMA Spline

 For the 3 tests cases, relatives errors are less for method of Spline in the most of case, when the percentage of missing data is more than 10%.  The method ARIMA is more adapted when we have less than 10% of missing data in the dataset.  The above tests use only two variables for the same crop (area and production). If the number of missing data exceeds 40%, it will be appropriated to use a third correlated control variable.

THANK YOU