Download presentation
Presentation is loading. Please wait.
Published byLaurel Bradford Modified over 8 years ago
1
Use of administrative data for outlier detection in the VI Italian agriculture census A. Reale 1, M. Riani 2, M. Greco 1, G. Ruocco 1 1 ISTAT, Census Department; 2 University of Parma, Department of Economics (giruocco@istat.it) ICAS-VI Rio de Janeiro, Brazil 23-25 October 2013
2
Outline 1.Introduction 2.E&I during data collection 3.Outlier detection procedure 4.Main steps of outlier detection 5.Results 6.Conclusions 7.References
3
For the 6th Italian Agricultural Census, following a quality oriented approach, the detection of outliers and influential errors (selective editing) has been performed mainly during data collection. After data capturing, two main correction stages have been centrally managed by Istat. Introduction 1
4
E&I during data collection (1) 2 In order to prevent and correct fatal errors and missing values during data capturing, different check tools have been implemented in the Survey Management System (SGR). Census Data Collection System A subset of 220 checking rules (fatal and query) has been integrated in the web based data entry System Before the final release of data to the census DB, to localize potential errors slipped during data gathering Questionnaire editing Farms/enumerators Automatic check Data collection staff
5
E&I during data collection (2) 3 During data gathering, two distinct procedures have been implemented and managed centrally by Istat to detect influential errors and outlier values. ISTAT E&I SYSTEM Outliers detection 1.Forward Search Technique 2.Manual Review (of anomalous values by data collection staff) Micro-editing check Underlines inconsistent data by analyzing at unit level the coherence between the answers referring to related topics
6
The special procedure to detect outliers has been implemented in partnership with the University of Parma and centrally applied by Istat. The selection of outliers has been based on the robust technique of Forward Search to identify the farms whose census values were significantly divergent from the information registered by the Agency for Subsidies in Agriculture (AGEA). Data have been computed by using the core routines from FSDA toolbox for Matlab, jointly developed by the University of Parma and the Joint Research Centre of the European Community and freely downloadable from: http://www.riani.it/MATLAB http://fsda.jrc.ec.europa.eu Outlier detection procedure 4
7
Main steps of outlier detection 5 Data linkage Data input for Matlab Methodological issues Matlab Matlab Output Processing List of Outliers sent to the regional census offices for manual revision
8
Data Linkage 6 AGEA CENSUS Fiscal code of the holder as linking key About 85% of census farms have matched with administrative units having at least one of the checked areas
9
Census units have been divided in strata, defined according to the farm size, location and the area invested in the following surfaces: Utilized Agricultural Area (UAA), Total Area, Vineyards and Olive Plantations. Only strata having at least 10 observations have been processed. Data Input for Matlab 7
10
Outlier: unit whose behaviour markedly deviates from most of the observations in the distribution. The hypothesis underlying this method is that admissible inconsistencies between two data sources should depend on different classification schemes, reference time, or target population. Use of a robust method, like Forward Search (FS) technique to avoid masking (false negative) and swamping (false positive) problems in the outlier detection. Methodological issues (1) 8
11
Total Agricultural Area analysis Examples of distributions having outliers whose behaviour follows a systematic pattern unlikely due to random recording problems. Methodological issues (2) 9
12
Main steps of the FS for regression: - Start from subsets of size m (m 0 = p where p is the number of explanatory variables) increasing until all units not in the subset are identified as outliers. - Sort of residuals by Mahalanobis distance, and inclusion of the m+1 observations having the minimum squared residuals. - Monitoring of the fitted model changes due to the added units. - Use of least squares for parameter estimation for each subset. - Analysis of minimum deletion residuals. Methodological issues (3) 10
13
Methodological issues (4) 11 Outliers are detected by monitoring the minimum deletion residuals of the observations not in the outlier-free subset where is the square root of the estimated residual variance computed from the observations in
14
Methodological issues (5) 12 The analysis of minimum deletion residuals highlights a peak corresponding to the iteration immediately preceding the inclusion of the first outlier value.
15
Regression line Y=aX+b Parameters estimation a and b, with and without outliers. Statistical significance and goodness of fit of the regression model (R 2 ). Methodological issues (6) 13 Census Administrative Register
16
Outliers have been identified setting a 99% confidence band. For normally distributed data, outliers are expected to be found in 1% of the processed subsets. The graphic approach implemented in Matlab underlines both, model inadequacy and model adjustment. Matlab output (1) 14 Example of results of the outlier detection procedure
17
Matlab output (2) 17 Matlab outputs different txt files, listing the estimated parameters, both for strata and single observations.The whole procedure has been managed by CONCERT, a web java console, implemented for scheduling and monitoring processes and their outputs. For ranking detected outliers, a score function has been computed, according to the main parameters of the procedure and the percent variation between administrative and census values. Exclusion from the manual revision of census units having total area or UAA greater <1 ha, or Vineyards, or olive plantation area <0.5 ha. Units with fatal errors, identified according to the whole set of editing rules, have been added to detected outliers in the reports sent to the regional census offices (only for the Regions which have adopted the High Level Participation Model and have recorded collected data).
18
Overview of outlier detection results (1) *For the Regions with an integrative participation model, only the outlier detection has been performed during data collection due to the limited number of variables recorded for provisional figures 20
19
Overview of outlier detection results (2) Distribution of outliers values by type of organizational model, variable and outcome of further investigation: absolute and percentage values (in brackets) 21
20
Conclusions 22 The selective editing supported by the available information from administrative sources has limited the respondent burden. The review of outliers and fatal errors during data gathering has improved provisional data quality, thus reducing the gap between initial and final results.
21
References Atkinson A. C., Riani M. (2000), Robust Diagnostic Regression Analysis, Springer, New York. Atkinson A. C., Riani M., Cerioli A. (2004), Exploring Multivariate Data with the Forward Search, Springer, New York. García-Escudero, L.A., Gordaliza, A., Mayo-Iscar, A., San Martin, R., (2010). Robust clusterwise linear regression through trimming. Computational Statistics and Data Analysis 54, 3057-3069. doi:10.1016/j.csda.2009.07.002. Maronna, R.A.,Martin, D.R., Yohai, V.J., (2006). Robust Statistics: Theory and Methods. Wiley, New York. Reale A., Torti F., Riani M (2012). Robust methods for correction and control of Italian Agriculture Census data, 46th Scientific Meeting Of The Italian Statistical Society, Sapienza University of Rome - Faculty of Economics June 20 – 22, 2012. Torti F., Perrotta D., Francescangeli P., Bianchi G. (2013). A robust procedure based on forward search to detect outliers. Conference New Techniques and Technologies for Statistics 2013 Brussels 5-7 March 2013. Riani, M., Atkinson, A.C. (2007). Fast calibrations of the forward search for testing multiple outliers in regression. Advances in Data Analysis and Classification 151, 123-141. Riani M., Perrotta D. and Torti F. (2012). FSDA: AMATLAB toolbox for robust analysis and interactive data exploration, Chemometrics and Intelligent Laboratory Systems, in press doi10.1016/j.chemolab.2012.03.017
22
Thank you!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.