CountrySTAT Team-I November 2014, ECO Secretariat,Teheran
Introduction Origin of missing data Nature of missing data Implemented methodologies Proposed methodologies Results Conclusion
The objective of this presentation is to introduce basics tools to handle missing data in CountrySTAT and FAOSTAT domains. They are based on simple and friendly approach, easy to use. The CountrySTAT agricultural production domain was used as a basis to develop and test imputation and validation methodologies that could assist in standardisation across the different statistical domains presents at FAO level.
Data are missing for different reasons 1) The value has not been measured (forget...); 2) The value is measured but lost; 3) The value is measured, but considered unusable (outliers, etc.); 4) The value is measured but unavailable.
In a dataset, data can be 1) Missing completely at random (MCAR): when the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. P(r |Y observed ;Y missing ) = P(r ) 2) Missing at random (MAR): when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data. P(r |Y observed ;Y missing ) = P(r |Y observed ) 3) Not missing at random (NMAR): when data are not MCAR or MAR P(r |Y observed ;Y missing ) = P(r |Y observed ;Y missing ) 4) Censored and Truncated Data. Data use to be MCAR or MAR
A) Deductive or logical imputation; B) Mean imputation; C) Ratio imputation; D) Regression imputation; E) Donor imputation (hot-deck, cold-deck, nearest neighbor) ; F) Multiple imputation : Because it is not deterministic, it is not applicable to officials statistics.
expert judgment last observations carried forward linear interpolation growth-rate benchmarking yield estimation multivariate approach These imputations are based on deductive or logical imputation, ratio imputation and donor imputation. The selected method is based on Regression imputation method. WHY? already applied under development trend smoothing tested but not applied
Year Area
A linear trend is assumed to exist between the start- and endpoints of gaps in the time series. Let y 0, y 1,..., y t-l denote the data points with values obtained from official sources before the gap and y t+r, y t+r+1,..., y m denote the data points with official values after the gap. The imputed values are calculated as:
Year Area
Year Area Production
Used methods are based on regression imputation and used EM-algorithm : 1)Yield estimation: estimate yield using an arima model; 2)Linear regression: Use a linear regression between P t and A t including Trend; 3)Arima model: Estimate P t and A t using ARIMA model; 4) Spline regression: Estimate P t and A t using spline;
How it is work ?
Compute a yield time series Y t containing missing data: Y t =P t /A t, where P t is the production and A t is the area harvested at time t; Use linear interpolation method to obtain starting values; ARIMA(0,1,1): Y t =Y t-1 + α+ ε t - θ 1 * ε t-1 ; EM algorithm. Use Yield estimate to impute Production and Area Harvested. Where P t and A t are missing, we use last observation carried forward method to impute area harvested.
The model assumes linear relationship between Production and Area Harvested; P t = Y t *A t P t = Production in the year t; A t = Area Harvested in the year t; Y t = Yield in the year t. Algorithm: 1) Linear interpolation for Area for starting values; 2) Repeat and update until the convergence of prediction values: P t = α+ β 1 *Trend + β 2 *A t + ε t (EM-Algorithm to impute P t ) A t = α+ β 1 *Trend + β 2 *P t + ε t (EM-Algorithm to impute A t )
The ARIMA models must be identified ARIMA(0,1,1): Y t =Y t-1 + α+ ε t - θ 1 * ε t-1 ; Use relation between Production and Area Use these variable as time series and Impute using EM- algorithm. Package mtsdi of R. Impute using ARIMA model for Pt and At imputation
Form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. For each interval, we try estimate a polynomial function which fit well data. Spline interpolation is preferred over polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline. Package mtsdi of R. Impute using Spline regression for Pt and At imputation
We use reals data to test proposed methodologies: Yield estimation, Linear Regression, ARIMA, Spline We add also linear interpolation Data are from CountrySTAT-Mali website. Missing data are generated randomly. Data are from 1984 to Use real data to test.
Test case: Maize. Missing data at 10 %.
We perform again these methods on the same dataset at different percentages of missing data.
% MissingMethodMinMaxMeanStd.Dev 10 Linear.Int. Yield Linear Reg. ARIMA Spline Linear.Int. Yield Linear Reg. ARIMA Spline Linear.Int. Yield Linear Reg. ARIMA Spline
For the 3 tests cases, relatives errors are less for method of Spline in the most of case, when the percentage of missing data is more than 10%. The method ARIMA is more adapted when we have less than 10% of missing data in the dataset. The above tests use only two variables for the same crop (area and production). If the number of missing data exceeds 40%, it will be appropriated to use a third correlated control variable.
THANK YOU