Use of administrative data for outlier detection in the VI Italian agriculture census A. Reale 1, M. Riani 2, M. Greco 1, G. Ruocco 1 1 ISTAT, Census Department;

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
12 Multiple Linear Regression CHAPTER OUTLINE
Ch11 Curve Fitting Dr. Deshi Ye
A Short Introduction to Curve Fitting and Regression by Brad Morantz
15 de Abril de A Meta-Analysis is a review in which bias has been reduced by the systematic identification, appraisal, synthesis and statistical.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Analysis of Variance. Experimental Design u Investigator controls one or more independent variables –Called treatment variables or factors –Contain two.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
Correlation & Regression
Eurostat Statistical Data Editing and Imputation.
Simple Linear Regression
Antonio Bernardi - Fulvia Cerroni - Viviana De Giorgi (Istat) An application to the Tax Authority Source (Sector Studies) Session: Administrative data.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Hydrologic Modeling: Verification, Validation, Calibration, and Sensitivity Analysis Fritz R. Fiedler, P.E., Ph.D.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
Resistant Learning on the Envelope Bulk for Identifying Anomalous Patterns Fang Yu Department of Management Information Systems National Chengchi University.
Software Systems for Survey and Census Yudi Agusta Statistics Indonesia (Chief of IT Division Regional Statistics Office of Bali Province) Joint Meeting.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
Chapter Thirteen Validation & Editing Coding Machine Cleaning of Data Tabulation & Statistical Analysis Data Entry Overview of the Data Analysis.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
Challenges in Collecting Police-Reported Crime Data Colin Babyak Household Survey Methods Division ICES III - Montreal – June 20, 2007.
Correlation & Regression
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
for statistics based on multiple sources
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
2 nd Inter- Agency and Expert Group Meeting (IAEGM) Organized by: ESCWA October, 2009 Beirut, Lebanon Mohamed Barre FAO-RNE Regional Statistician.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
1 C. ARRIBAS, D. LORCA, A. SALINERO & A. COLMENERO Measuring statistical quality at the Spanish National Statistical Institute.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Copyright 2010, The World Bank Group. All Rights Reserved. Principles, criteria and methods Part 2 Quality management Produced in Collaboration between.
Process Quality in ONS Rachel Skentelbery, Rachael Viles & Sarah Green
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Tutorial I: Missing Value Analysis
An assessment of the robustness of weights in the Famille et Employeurs survey Nicolas Razafindratsima & Elisabeth Morand.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Lesson Testing the Significance of the Least Squares Regression Model.
Economics 173 Business Statistics Lecture 18 Fall, 2001 Professor J. Petry
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Multiple Linear Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Survey phases, survey errors and quality control system
Structural Business Statistics Data validation
Survey phases, survey errors and quality control system
Software Systems for Survey and Census
STATISTICAL AGENCY UNDER PRESIDENT OF THE REPUBLIC OF TAJIKISTAN
Checking the data and assumptions before the final analysis.
Istat - Structural Business Statistics
Chapter 13 Additional Topics in Regression Analysis
Presentation transcript:

Use of administrative data for outlier detection in the VI Italian agriculture census A. Reale 1, M. Riani 2, M. Greco 1, G. Ruocco 1 1 ISTAT, Census Department; 2 University of Parma, Department of Economics ICAS-VI Rio de Janeiro, Brazil October 2013

Outline 1.Introduction 2.E&I during data collection 3.Outlier detection procedure 4.Main steps of outlier detection 5.Results 6.Conclusions 7.References

For the 6th Italian Agricultural Census, following a quality oriented approach, the detection of outliers and influential errors (selective editing) has been performed mainly during data collection. After data capturing, two main correction stages have been centrally managed by Istat. Introduction 1

E&I during data collection (1) 2 In order to prevent and correct fatal errors and missing values during data capturing, different check tools have been implemented in the Survey Management System (SGR). Census Data Collection System A subset of 220 checking rules (fatal and query) has been integrated in the web based data entry System Before the final release of data to the census DB, to localize potential errors slipped during data gathering Questionnaire editing Farms/enumerators Automatic check Data collection staff

E&I during data collection (2) 3 During data gathering, two distinct procedures have been implemented and managed centrally by Istat to detect influential errors and outlier values. ISTAT E&I SYSTEM Outliers detection 1.Forward Search Technique 2.Manual Review (of anomalous values by data collection staff) Micro-editing check Underlines inconsistent data by analyzing at unit level the coherence between the answers referring to related topics

The special procedure to detect outliers has been implemented in partnership with the University of Parma and centrally applied by Istat. The selection of outliers has been based on the robust technique of Forward Search to identify the farms whose census values were significantly divergent from the information registered by the Agency for Subsidies in Agriculture (AGEA). Data have been computed by using the core routines from FSDA toolbox for Matlab, jointly developed by the University of Parma and the Joint Research Centre of the European Community and freely downloadable from: Outlier detection procedure 4

Main steps of outlier detection 5 Data linkage Data input for Matlab Methodological issues Matlab Matlab Output Processing List of Outliers sent to the regional census offices for manual revision

Data Linkage 6 AGEA CENSUS Fiscal code of the holder as linking key About 85% of census farms have matched with administrative units having at least one of the checked areas

Census units have been divided in strata, defined according to the farm size, location and the area invested in the following surfaces: Utilized Agricultural Area (UAA), Total Area, Vineyards and Olive Plantations. Only strata having at least 10 observations have been processed. Data Input for Matlab 7

Outlier: unit whose behaviour markedly deviates from most of the observations in the distribution. The hypothesis underlying this method is that admissible inconsistencies between two data sources should depend on different classification schemes, reference time, or target population. Use of a robust method, like Forward Search (FS) technique to avoid masking (false negative) and swamping (false positive) problems in the outlier detection. Methodological issues (1) 8

Total Agricultural Area analysis Examples of distributions having outliers whose behaviour follows a systematic pattern unlikely due to random recording problems. Methodological issues (2) 9

Main steps of the FS for regression: - Start from subsets of size m (m 0 = p where p is the number of explanatory variables) increasing until all units not in the subset are identified as outliers. - Sort of residuals by Mahalanobis distance, and inclusion of the m+1 observations having the minimum squared residuals. - Monitoring of the fitted model changes due to the added units. - Use of least squares for parameter estimation for each subset. - Analysis of minimum deletion residuals. Methodological issues (3) 10

Methodological issues (4) 11 Outliers are detected by monitoring the minimum deletion residuals of the observations not in the outlier-free subset where is the square root of the estimated residual variance computed from the observations in

Methodological issues (5) 12 The analysis of minimum deletion residuals highlights a peak corresponding to the iteration immediately preceding the inclusion of the first outlier value.

Regression line Y=aX+b Parameters estimation a and b, with and without outliers. Statistical significance and goodness of fit of the regression model (R 2 ). Methodological issues (6) 13 Census Administrative Register

Outliers have been identified setting a 99% confidence band. For normally distributed data, outliers are expected to be found in 1% of the processed subsets. The graphic approach implemented in Matlab underlines both, model inadequacy and model adjustment. Matlab output (1) 14 Example of results of the outlier detection procedure

Matlab output (2) 17 Matlab outputs different txt files, listing the estimated parameters, both for strata and single observations.The whole procedure has been managed by CONCERT, a web java console, implemented for scheduling and monitoring processes and their outputs. For ranking detected outliers, a score function has been computed, according to the main parameters of the procedure and the percent variation between administrative and census values. Exclusion from the manual revision of census units having total area or UAA greater <1 ha, or Vineyards, or olive plantation area <0.5 ha. Units with fatal errors, identified according to the whole set of editing rules, have been added to detected outliers in the reports sent to the regional census offices (only for the Regions which have adopted the High Level Participation Model and have recorded collected data).

Overview of outlier detection results (1) *For the Regions with an integrative participation model, only the outlier detection has been performed during data collection due to the limited number of variables recorded for provisional figures 20

Overview of outlier detection results (2) Distribution of outliers values by type of organizational model, variable and outcome of further investigation: absolute and percentage values (in brackets) 21

Conclusions 22 The selective editing supported by the available information from administrative sources has limited the respondent burden. The review of outliers and fatal errors during data gathering has improved provisional data quality, thus reducing the gap between initial and final results.

References Atkinson A. C., Riani M. (2000), Robust Diagnostic Regression Analysis, Springer, New York. Atkinson A. C., Riani M., Cerioli A. (2004), Exploring Multivariate Data with the Forward Search, Springer, New York. García-Escudero, L.A., Gordaliza, A., Mayo-Iscar, A., San Martin, R., (2010). Robust clusterwise linear regression through trimming. Computational Statistics and Data Analysis 54, doi: /j.csda Maronna, R.A.,Martin, D.R., Yohai, V.J., (2006). Robust Statistics: Theory and Methods. Wiley, New York. Reale A., Torti F., Riani M (2012). Robust methods for correction and control of Italian Agriculture Census data, 46th Scientific Meeting Of The Italian Statistical Society, Sapienza University of Rome - Faculty of Economics June 20 – 22, Torti F., Perrotta D., Francescangeli P., Bianchi G. (2013). A robust procedure based on forward search to detect outliers. Conference New Techniques and Technologies for Statistics 2013 Brussels 5-7 March Riani, M., Atkinson, A.C. (2007). Fast calibrations of the forward search for testing multiple outliers in regression. Advances in Data Analysis and Classification 151, Riani M., Perrotta D. and Torti F. (2012). FSDA: AMATLAB toolbox for robust analysis and interactive data exploration, Chemometrics and Intelligent Laboratory Systems, in press doi /j.chemolab

Thank you!!!