Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

Slides:



Advertisements
Similar presentations
Introduction: the New Price Index Manuals Presentation Points IMF Statistics Department.
Advertisements

Introduction to Hypothesis Testing
Introductory Mathematics & Statistics for Business
Chapter 3 Introduction to Quantitative Research
Chapter 3 Introduction to Quantitative Research
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin 14-1 Chapter Fourteen Auditing Financing Process: Prepaid Expenses.
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
1 « June, 6 and 7, 2007 Paris « Satellite Account for Education for Portugal: Implementation process and links with the National Accounts and Questionnaire.
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
1 Superior Safety in Noninferiority Trials David R. Bristol To appear in Biometrical Journal, 2005.
Variance Estimation in Complex Surveys Third International Conference on Establishment Surveys Montreal, Quebec June 18-21, 2007 Presented by: Kirk Wolter,
Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch.
Linearization Variance Estimators for Survey Data: Some Recent Work
Comparison of Simulation Methods Using Historical Data in the U.S. International Price Program M.J. Cho, T-C. Chen, P.A. Bobbitt, J.A. Himelein, S.P. Paben,
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
The Challenge of Integrating New Surveys into an Existing Business Survey Infrastructure Éric Pelletier Statistics Canada ICES-III Montréal, Québec, Canada.
Characterization and Management of Multiple Components of Cost and Risk in Disclosure Protection for Establishment Surveys Discussion of Advances in Disclosure.
Web Design Issues in a Business Establishment Panel Survey Third International Conference on Establishment Surveys (ICES-III) June 18-21, 2007 Montréal,
1 Eloise E. Kaizar The Ohio State University Combining Information From Randomized and Observational Data: A Simulation Study June 5, 2008 Joel Greenhouse.
Sampling Research Questions
Page 1 Measuring Survey Quality through Representativity Indicators using Sample and Population based Information Chris Skinner, Natalie Shlomo, Barry.
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
C82MST Statistical Methods 2 - Lecture 2 1 Overview of Lecture Variability and Averages The Normal Distribution Comparing Population Variances Experimental.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Audit Evidence Week 11.
Chapter 3 Critically reviewing the literature
Assumptions underlying regression analysis
Probability Distributions
Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay.
STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS
1 Alberto Montanari University of Bologna Basic Principles of Water Resources Management.
Chapter 13 Overall Audit Plan and Audit Program
Secondary Data, Literature Reviews, and Hypotheses
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
1 Probabilistic Uncertainty Bounding in Output Error Models with Unmodelled Dynamics 2006 American Control Conference, June 2006, Minneapolis, Minnesota.
1 Panel Data Analysis – Advantages and Challenges Cheng Hsiao.
Writing a Method Section
AADAPT Workshop South Asia Goa, December 17-21, 2009 Kristen Himelein 1.
Chapter 18: The Chi-Square Statistic
Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.
©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Controls and Substantive Tests of Transactions.
IP, IST, José Bioucas, Probability The mathematical language to quantify uncertainty  Observation mechanism:  Priors:  Parameters Role in inverse.
Chapter 11: The t Test for Two Related Samples
The University of Michigan Georgia Institute of Technology
Environmental Data Analysis with MatLab Lecture 21: Interpolation.
The STARTS Model David A. Kenny December 15, 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
STAT 497 APPLIED TIME SERIES ANALYSIS
Chapter Three Research Design.
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
Chapter 8 Introduction to Hypothesis Testing
Aggregate and Systemic Components of Risk in Total Survey Error Models John L. Eltinge U.S. Bureau of Labor Statistics International Total Survey Error.
Measurement Error.
Two Approaches to the Use of Administrative Records to Reduce Respondent Burden and Data Collection Costs John L. Eltinge Office of Survey Methods Research.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Evaluation of Multiple Components of Error in the Collection and Integration of Survey and Administrative Record Data John L. Eltinge International.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
3-1 Copyright © 2010 Pearson Education, Inc. Chapter Three Research Design.
Time Series - A collection of measurements recorded at specific intervals of time. 1. Short term features Noise: Spike/Outlier: Minor variation about.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Statistical Data Analysis
Bias and Variance of the Estimator
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Presentation transcript:

Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau of Labor Statistics ICES III Session #66 – June 21, 2007

2 Acknowledgements and Disclaimer: The author thanks Jean-Francois Beaumont, Terry Burdette, Pat Cantwell, Larry Ernst, Julie Gershunskaya, Pat Getz, Howard Hogan, Erin Huband, Larry Huff, John Kovar, Mary Mulry and Susana Rubin-Bleuer for many helpful discussions. This paper expands on many of the ideas developed by Pat Cantwell originally in Eltinge and Cantwell (2006). The views expressed in this paper are those of the author and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics.

3 Overview: I.Drill-Down Procedures for Outlier Detection II.Available Information III.Costs of Drill-Down Procedures IV.Risks of Drill-Down Procedures V.Optimization of Drill-Down Procedures

4 I.Drill-Down Methods of Outlier Detection A.Outliers:Extreme Values 1. Usually (not always) large positive values 2. Review Article: Lee (1995) 3. Variant on Chambers (1986): a. Representative outliers b. Non-representative outliers c. Gross measurement error

5 B.Predominant Literature Focuses on: 1. Extreme values of: a. Unweighted individual observation b. Weighted individual observation 2.The impact of (1.a) and (1.b) on estimators at fairly high levels of aggregation a. Means, totals, other descriptive quantities b. Regression coefficients, other analytic parameters

6 C. Drill-Down Methods 1.(Implicit) assumptions in most outlier literature: a. Low or zero cost of data review, relative to other cost components b. Reference distribution(s) known or readily determined at relatively low cost 2. Issues: a. For many surveys, modal task is data review - Substantial overall expense b. Reference distributions not obvious nor readily obtained (especially for establishment surveys)

7 3. Some agency programs use drill-down procedures: a.Begin data review with examination of relatively fine estimation cells b.Identify estimation cells with extreme initial point estimates c.Examine microdata in identified extreme cells d.Limited formal literature: Exceptions: Luzi and Pallara (1999), DiZio et al. (2005)

8 D.Questions: 1.Under what conditions are drill-down procedures preferable to standard methods of outlier detection and treatment, based on a balanced assessment of: a. Available information b. Costs c. Risks 2. Does the characterization in (1) shed any light on possible approaches to optimization of drill-down procedures?

9 II.Available Information A. Usual Outlier Framework: Reference Distributions from 1. Internal reference distribution: a. Outliers defined with respect to quantiles or other functionals of the full set of sample responses b. Limitations: Small subpopulations; time constraints 2. External reference distributions: a. Observations from similar surveys in previous periods b. Related data from frame or other administrative records c. Limitation: Full comparability?

10 B.Information for Drill-Down Procedures 1.Cell level: a. Generally an implicit prior distribution based on: - Historical and seasonal patterns for an individual cell and related cells; recent aggregate changes - Special information on, e.g., strikes, weather b. Consider formalization through modeling or a full Bayesian framework?

11 2.Microdata level a.Individual observations from current or previous waves of the survey b.Again here, could consider formalization through a Bayesian approach c.For many cells, sample sizes too small to make direct use of tails of empirical distriution alone 3. For both cell and microdata level reviews, the critical values (and corresponding tail probabilities) often remain implicit

12 III.Costs of Drill-Down Procedures A.Review of fewer units at the microdata level should reduce costs B.Quantification of (A) depends on fixed and variable components, e.g., 1. Fixed costs of training for specific industry 2. Incremental cost of review of - one additional cell - one additional response within cell C.Evaluations in (B) complicated by 1. Peak-load staffing constraints 2. Limitations on available accounting information 3. Non-monetary cost constraints, e.g., time

13 IV. Risks of Drill-Down Procedures A. Context for Development and Evaluation: Six Cases (Eltinge and Cantwell, 2006) Case 1: Traditional randomization-based inference for aggregates of the finite population Cases 2 and 3: View finite population as a realization of a superpopulation model Predict function of or estimate

14 Cases 1-3 have dominated literature to date Primary results: Bias-variance trade-offs Reduction of overall mean squared error Explicitly or implicitly use some modeling conditions, e.g., Weibull or other distributional assumption Randomization performance still of interest

15 Case 4 and 5: View true finite population values as sum: where represents long-term smooth trend and represents an irregular component, of true values, both generated by superpopulation models (cf. some discussion of outliers in time series, e.g., Galeano et al., 2006) Prediction for functions of (Case 4) or superpopulation parameter (Case 5) Detailed development depends heavily on model- identification issues, available auxiliary information

16 Case 6: Distinguish between central portion and fringes of population Multivariate normal example: Within central ellipsoid Conceptual links with topcoding, disclosure limitation, core CPI Need to explore: Interest only in central quantiles (Rao et al., 1990; Francisco and Fuller, 1991), or in the core subpopulation as such?

17 B. Cell-Level Risks of Type I and Type II Error 1.Distinguish between a. Primary estimands (examined directly in drill-down procedure) - Risk of implicit overfitting within the selected cell b. Secondary estimands (not examined directly, but important for some subsequent publications) - Risk of masking outliers in dimensions orthogonal to the primary estimand 2. Impact on MSE for resulting primary, secondary estimators

18 2.Unit-level deletion within the extreme cells approximates the survey-weighted influence functions for the cell-level estimand: cf. standard literature on survey-weighted influence functions for aggregate-level estimands (Smith, 1987; Zaslavsky et al , p. 861): where

19 C. Evaluation and Reduction of Risks Not Fully Reflected in Mean Squared Error 1.Squared error loss may not fully reflect risk functions of program managers, other stakeholders 2.Alternative: Risks associated with low-probability event that published estimate differs markedly from: a. True value b. Predicted value based on auxiliary information 3.Consider application of other risk measures, e.g., false discovery rate in machine learning D.Operational Risk: Will a given procedure for outlier detection and treatment be carried out as specified?

20 VI.Closing Remarks A.Summary: Drill-Down Procedures 1. Contrast with standard approaches to outliers and influential observations 2. Requires consideration of a. Available information b. Costs c. Risks 3. Optimization approaches

21 B.Alternatives to Current Drill-Down Procedures 1. Apply adaptive sampling procedures (Seber and Thompson, 1996) to selection of some cells for additional drill-down review a. Condition: Network structure informative for presence of outliers b. May be of special interest for outliers arising from gross errors from a common data-collection or administrative-record source c. Extend inference to account for cells that are not examined in depth

22 2.Instead of cells defined a priori (e.g., by geography, industry and size class), consider cells generated through tree-based machine learning methods (e.g., Brieman et al., 1984) a. Resulting properties depend on specific pruning method used for the trees b. Standard cross-validation methods have some imitations for complex survey data c. Screening to identify problems masked by customary cell structure