Outlier Treatment in HCSO Present and future. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future.

Slides:



Advertisements
Similar presentations
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
Advertisements

Katherine Jenny Thompson
Innovation data collection: Methodological procedures & basic forms Regional Workshop on Science, Technology and Innovation (STI) Indicators.
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Innovation Surveys: Advice from the Oslo Manual South Asian Regional Workshop on Science, Technology and Innovation Statistics Kathmandu,
Innovation Surveys: Advice from the Oslo Manual National training workshop Amman, Jordan October 2010.
Statistics for Improving the Efficiency of Public Administration Daniel Peña Universidad Carlos III Madrid, Spain NTTS 2009 Brussels.
Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Describing Quantitative Variables
Challenges in small area estimation of poverty indicators
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Departments of Medicine and Biostatistics
Descriptive Statistics Statistical Notation Measures of Central Tendency Measures of Variability Estimating Population Values.
Chapter 11: Inference for Distributions
Using Statistics in Research Psych 231: Research Methods in Psychology.
SHOWTIME! STATISTICAL TOOLS IN EVALUATION DESCRIPTIVE VALUES MEASURES OF VARIABILITY.
Overview of Robust Methods Analysis Jinxia Ma November 7, 2013.
 Deviation is a measure of difference for interval and ratio variables between the observed value and the mean.  The sign of deviation (positive or.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
Eurostat Repeated surveys. Presented by Eva Elvers Statistics Sweden.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
● Midterm exam next Monday in class ● Bring your own blue books ● Closed book. One page cheat sheet and calculators allowed. ● Exam emphasizes understanding.
Determining the Sampling Plan
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Use of Administrative Data in Statistics Canada’s Annual Survey of Manufactures Steve Matthews and Wesley Yung May 16, 2004 The United Nations Statistical.
Business Research Methods William G. Zikmund Chapter 17: Determination of Sample Size.
PPA 501 – Analytical Methods in Administration Lecture 5a - Counting and Charting Responses.
Impact of using fiscal data on the imputation strategy of the Unified Enterprise Survey of Statistics Canada Ryan Chepita, Yi Li, Jean-Sébastien Provençal,
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
Psychology 301 Chapters & Differences Between Two Means Introduction to Analysis of Variance Multiple Comparisons.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 1 Slide Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Performance of Resampling Variance Estimation Techniques with Imputed Survey data.
VII-1 Stratification Case study to illustrate alternative methods to stratify a sampling frame Dr. Will Yancey, CPA This material is the property of the.
Handbook on Residential Property Price Indices Chapter 5: Methods Jan de Haan UNECE/ILO Meeting, May 2010.
Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas.
© Federal Statistical Office, Institute for Research and Development in Federal Statistics, Elmar Wein Federal Statistical Office Introducing and implementing.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Measures of Central Tendency: The Mean, Median, and Mode
ANOVA: Analysis of Variance. The basic ANOVA situation Two variables: 1 Nominal, 1 Quantitative Main Question: Do the (means of) the quantitative variables.
Understanding Sampling
Numeric Summaries and Descriptive Statistics. populations vs. samples we want to describe both samples and populations the latter is a matter of inference…
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Robust Estimators.
Chapter 16 Exploratory data analysis: numerical summaries CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics Instructor:
Chapter 5: Measures of Dispersion. Dispersion or variation in statistics is the degree to which the responses or values obtained from the respondents.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
Descriptive Statistics
Descriptive statistics Describing data with numbers: measures of variability.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
DETECTION OF OUTLIERS IN THE CANADIAN CONSUMER PRICE INDEX (CPI) DETECTION OF OUTLIERS IN THE CANADIAN CONSUMER PRICE INDEX (CPI) ABDELNASSER SAÏDI AND.
The business process models and quality issues at the Hungarian Central Statistical Office (HCSO) Mr. Csaba Ábry, HCSO, Methodological Department Geneva,
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
Statistical Inferences for Variance Objectives: Learn to compare variance of a sample with variance of a population Learn to compare variance of a sample.
FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.
Planning a Simulation Study
Central Tendency and Variability
2. Stratified Random Sampling.
Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka
Psych 231: Research Methods in Psychology
Sampling and estimation
The Mean Variance Standard Deviation and Z-Scores
Presentation transcript:

Outlier Treatment in HCSO Present and future

Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future work Introduction of a new tool: R and Rstudio UNECE Statistical Data Editing

Outlier detection and treatment Purpose of outlier detection Identify errors Estimation Editing Representative outliers Non Representative outliers Decreasing weights Changing the values Using robust estimations Source: MEMOBUST UNECE Statistical Data Editing

Monthly Survey of Manufacturing Take-all part Survey part: less than 50 employees (and more than 5, because the smallest businesses are not in the scope of the survey). The sampling frame is based on the Register of Enterprises (~10 thousand units) The sampling ratio is about 15% Stratified sample (a lot of NACE categories, categories of the number of employees, and two territorial strata: the capital and everything else). (Telegdi 2004.) UNECE Statistical Data Editing

Monthly Survey of Manufacturing: data Distribution of some variables Skewed distribution Visible outliers UNECE Statistical Data Editing

Current method of outlier detection The aim of the outlier treatment is improving the estimation. (Csereháti 2004.) Steps of the method: 1)Computing the outlier indicators 2)Manual outlier detection by the methodologist/expert 3)Transfer of the result to the subject matter statistician 4)Discussion of the result by the subject matter statistician (possible modifications), resembles to the process of selective editing UNECE Statistical Data Editing

Outlier indicators LNSQRT: main indicator Grubbs crit. value Standardized value of the variables SQUARED: identifying highest values MEANX is the ratio of the observed value of the unit and the weighted mean of the stratum without this unit value. VALOUT indicator shows the difference between the estimation of the total with and without the given value in a given stratum. UNECE Statistical Data Editing

The main indicator: LNSQRT UNECE Statistical Data Editing

Outlier treatment Weight trimming: weights of the outliers are changed to 1 Number of outliers: avg. 2% of the cases Change in the estimates: Mean: -15% (in avarage) Variance: serious decrease UNECE Statistical Data Editing

Alternative methods One dimensional methods Median absolute deviation Custom indicator: share in total Quantile Disadvantage: applying to many variables Multidimensional method: Mahalanobis distance based outlier detection UNECE Statistical Data Editing

Share in total, a custom indicator To consider the individual value and the size of the stratum in the same formula inspired by the current indicators The possible outlier: shares a considerably great amount of the total In a big stratum The indicator computed for each stratum UNECE Statistical Data Editing

Results Quantile method Threshold 99% The method can identify almost the same outliers as the current one. Easy to implement MAD Problem of the k (threshold) Too many cases were selected UNECE Statistical Data Editing

Results (2) Share in total Threshold value: 0.5 Smaller number of outliers Mahalanobis distance We used the robust Mahalanobis distance 3 key variables (Total revenue etc.) These are not involved in the current method avoiding missing values Similar results (2/3 of the current outliers are detected) UNECE Statistical Data Editing

UNECE Statistical Data Editing

Future plans Development of methodology: – More analysis of the effect on estimates – Winsorization Development of the process – Automation and reproducibility – More informative report on the process, to help better understand and analyse the process steps UNECE Statistical Data Editing

Experimental tools Outlier treatment is separated from other steps of data process, belongs to the methodology Possible new tool: R (with Rstudio) Advantage: ease of development Ready-to-use functions for outlier detection Disadvantage: need of „expert” user, not a usual tool UNECE Statistical Data Editing

Thank you for your attention! UNECE Statistical Data Editing