Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.

Similar presentations

Presentation on theme: "Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK."— Presentation transcript:

1 Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK

2 Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies

3 Overview Introduction Definitions Identification Treatment Recommendations

4 Introduction Deliverable 2.8 led by UK – UK leader worked in methodology over 14 years – Expert in Sample Design and Estimation for Business Surveys –... also expert in Small Area Estimation, Quality, Editing and Imputation, Time Series Analysis QA by Italy

5 Definitions Outliers Errors Outliers in survey data Outliers in administrative data Outliers in modelling... two glossaries considered: ONS and OECD

6 Definitions – outliers OECD “A data value that lies in the tail of the statistical distribution of a set of data values”

7 Definitions – outliers OECD “A data value that lies in the tail of the statistical distribution of a set of data values” ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

8 Definitions – outliers OECD “A data value that lies in the tail of the statistical distribution of a set of data values” ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

9 Definitions – outliers OECD “A data value that lies in the tail of the statistical distribution of a set of data values” ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

10 Definitions – outliers OECD “A data value that lies in the tail of the statistical distribution of a set of data values” ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” Question 1: extreme (1) influential (2) both (3)

11 Definitions – errors Errors are incorrect values identified by edit rules

12 Definitions – errors Errors are incorrect values identified by edit rules

13 Definitions – errors Errors are incorrect values identified by edit rules OECD “A logical condition or a restriction which must be met if the data is to be considered correct”

14 Definitions – errors Errors are incorrect values identified by edit rules OECD “A logical condition or a restriction which must be met if the data is to be considered correct” ONS “A rule designed to detect specific errors in data for potential subsequent correction”

15 Definitions – errors Errors are incorrect values identified by edit rules OECD “A logical condition or a restriction which must be met if the data is to be considered correct” ONS “A rule designed to detect specific errors in data for potential subsequent correction”

16 Definitions – errors Errors are incorrect values identified by edit rules OECD “A logical condition or a restriction which must be met if the data is to be considered correct” ONS “A rule designed to detect specific errors in data for potential subsequent correction” Errors are corrected before outliers are considered

17 Definitions – errors Errors are incorrect values identified by edit rules OECD “A logical condition or a restriction which must be met if the data is to be considered correct” ONS “A rule designed to detect specific errors in data for potential subsequent correction” Errors are corrected before outliers are considered Question 2: outliers = errors (1) outliers ≠ errors (2)

18 Definitions – survey outliers In the survey context, an outlier is an unrepresentative value

19 Definitions – survey outliers In the survey context, an outlier is an unrepresentative value influential

20 Definitions – survey outliers In the survey context, an outlier is an unrepresentative value influential A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population If the unit is unique, the assumption is invalid

21 Definitions – administrative outliers In the administrative context, an outlier is an atypical value

22 Definitions – administrative outliers In the administrative context, an outlier is an atypical value extreme

23 Definitions – administrative outliers In the administrative context, an outlier is an atypical value extreme Administrative data represent a census, so each unit is treated as unique No assumptions

24 Definitions – modelling outliers In the modelling context, an outlier is an influential value

25 Definitions – modelling outliers In the modelling context, an outlier is an influential value influential

26 Definitions – modelling outliers In the modelling context, an outlier is an influential value influential ONS “The amount of effect a particular point has on the parameters of a regression equation” Influence on processing and statistical modelling

27 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years”

28 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years” Processing – imputation “uplift last return by average growth in domain”

29 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years” Processing – imputation “uplift last return by average growth in domain” Statistical modelling

30 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years” Processing – imputation “uplift last return by average growth in domain” Statistical modelling

31 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years” Processing – imputation “uplift last return by average growth in domain” Statistical modelling

32 Definitions – modelling outliers Processing – editing “fail if > 60% of maximum over past 5 years” Processing – imputation “uplift last return by average growth in domain” Statistical modelling

33 Identification – units A data warehouse stores data once for repeated use

34 Identification – units A data warehouse stores data once for repeated use Each unit will have multiple values (variables/time periods), and whether any value is – extreme depends on which other data are used – influential depends on what process/model is estimated

35 Identification – units A data warehouse stores data once for repeated use Each unit will have multiple values (variables/time periods), and whether any value is – extreme depends on which other data are used – influential depends on what process/model is estimated Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted

36 Identification – units A data warehouse stores data once for repeated use Each unit will have multiple values (variables/time periods), and whether any value is – extreme depends on which other data are used – influential depends on what process/model is estimated Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier

37 Identification – units A data warehouse stores data once for repeated use Each unit will have multiple values (variables/time periods), and whether any value is – extreme depends on which other data are used – influential depends on what process/model is es ti mated Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier Question 3: yes (1) no (2) unsure (3)

38 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse

39 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse Expected data uses & egs of identification methods

40 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse Expected data uses & egs of identification methods – processing eg comparing observed and expected edit failures

41 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse Expected data uses & egs of identification methods – processing eg comparing observed and expected edit failures – updating the business register eg comparing different sources

42 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse Expected data uses & egs of identification methods – processing eg comparing observed and expected edit failures – updating the business register eg comparing different sources – survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges

43 Identification – uses Assuming all units are potential outliers – identification becomes use dependent – outliers are recorded as part of the metadata of an output – outliers are not otherwise recorded in the data warehouse Expected data uses & egs of identification methods – processing eg comparing observed and expected edit failures – updating the business register eg comparing different sources – survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges – survey/admin (modelling relationship & estimating survey) eg Cook’s distance & winsorisation

44 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE)

45 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE) Expected data uses & egs of treatment methods

46 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE) Expected data uses & egs of treatment methods – processing eg use medians rather than means

47 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE) Expected data uses & egs of treatment methods – processing eg use medians rather than means – updating the business register eg delete one source

48 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE) Expected data uses & egs of treatment methods – processing eg use medians rather than means – updating the business register eg delete one source – survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges

49 Treatment – units in uses Identified outliers need to be treated during use – to prevent distortion – by adjusting the weight of the unit to 0 < P < 100% – balancing reducing variance and increasing bias (ie MSE) Expected data uses & egs of treatment methods – processing eg use medians rather than means – updating the business register eg delete one source – survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges – survey/admin (modelling relationship & estimating survey) eg delete from modelling process & winsorisation

50 Recommendations 1.Neither data units nor their entries in a data warehouse should be labelled as outliers

51 Recommendations 1.Neither data units nor their entries in a data warehouse should be labelled as outliers 2.Identification and treatment of outliers should be unique to each instance data are used

52 Recommendations 1.Neither data units nor their entries in a data warehouse should be labelled as outliers 2.Identification and treatment of outliers should be unique to each instance data are used 3.Metadata on outliers should only be included in a data warehouse alongside outputs

53 Recommendations 1.Neither data units nor their entries in a data warehouse should be labelled as outliers 2.Identification and treatment of outliers should be unique to each instance data are used 3.Metadata on outliers should only be included in a data warehouse alongside outputs Question 4: agree (1) disagree (2) discuss! (3)

Download ppt "Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK."

Similar presentations

Ads by Google