Oversampling the capital cities in the EU SAfety SUrvey (EU-SASU) Task Force on Victimization Eurostat, February 2010 Guillaume Osier Service Central de la Statistique et des Etudes Economiques (STATEC) Social Statistics Division
Outline I.Some theory 1. Definitions and concepts 2. How to over-sample? 3. Why over-sample? 4. Impact on national accuracy II.Over-sampling the capital cities in the EU-SASU 1. Is this proposal (statistically) relevant? 2. How to determine the over-sampling rates? 3. Impact on the national accuracy III.Specific issues in relation to over-sampling
Definitions and concepts (i) A sub-group (d) in the population is said to be over-sampled (or over-represented) when the proportion of units from the sub-group is, on average, higher in the sample than in the reference population: (ii) Conversely, a sub-group is said to be under-sampled (or under- represented) when the proportion of units from the sub-group is, on average, lower in the sample than in the reference population: (iii) When a sub-group is neither over-sampled nor under-sampled, it is said to be well-sampled (or well-represented) Proportion of units from (d) in the population Average proportion of units from (d) in the sample
How to over-sample? In order to get implemented, over-sampling requires the units in the sub-group to be identified in advance of sampling (issue with telephone surveys) Two main techniques to over-sample: Stratification using unequal sampling fractions in the strata More general « proportional-to-size » sampling ( ps, pps…) Over-sampling rate for (d): Expected sample size in (d) under no over-sampling (i.e. under Simple Random Sampling) Expected sample size in (d)
Why over-sample? 1/2 By selecting more people from certain groups than would typically be done if everyone in the sample had an equal chance of being selected, over-sampling leads to more accurate estimates for those groups. The technique has proven particularly suitable to: Small sub-populations; Sub-populations having severe non-response problems; Sub-populations with large internal variability on the key variables (e.g., household wealth)
Why over-sample? 2/2 More generally, one can resort to over-sampling whenever the sample size doesn’t allow us to reach specified precision targets over certain sub-populations. Besides, in cross-national surveys (like the EU-SASU), over-sampling is essential for precision and hypothesis testing in cross-country comparisons. The choice of the sub-groups to over-sample is policy- driven (political matter)
Impact on national accuracy 1/3 Optimal (Neyman) allocation: in order to maximize the precision of the national sample under stratified simple random sampling, the sample size in stratum h depends both on the stratum population N h and the standard deviation S h of the study variable Stratum 1 Size N 1 St. deviation S 1 Stratum 2 Size N 2 St. deviation S 2 Stratum H Size N H St. deviation S H … Total population aged 16+
Impact on national accuracy 2/3 According to the previous formula, a larger sample should be taken if: * the stratum is larger * the stratum is more variable internally These national considerations may conflict with more “local” considerations: as said, from a local point of view, over-sampling often focus on small sub-populations, while national considerations lead to taking larger samples from the largest strata. Nevertheless, the loss in national accuracy is often limited:
Impact on national accuracy 3/3 Thus, if g=20%, we have / (opt) 1.02, which makes an increase in accuracy (as measured by the standard error) of 2%. Similarly, if g=30%, we have / (opt) 1.04, which makes an increase of 4%. In this sense the optimum can be described as flat. As a result, the impact of over-sampling on national accuracy should be limited, provided the sample sizes are not “extremely” different from the optimal ones. The impact is all the more limited given that the national sample sizes are generally large (thousands of units). Besides, by using powerful auxiliary information at national level, one may hope to increase sample precision a posteriori.
Over-sampling the capital cities in the EU- SASU: is this proposal relevant? Capital city = most populated city of the country Always the same as the political capital (except for Switzerland) Is the proposal (statistically) relevant? Sample size of individuals over the capital cities: is it enough to draw reliable conclusions? Victimization rates in the capital cities: are they generally higher than those for the rest of the country? Higher non-response in the capital cities? (often correct)
Minimum sample sizes for the capital cities
Source: International Crime and Victimization Survey (ICVS), 2005 Victimization rates in capital cities Victimization rates are higher in the capital cities than in the rest of the countries
How to determine the over-sampling rates? 1/4 Step 1: set up a precision target for every capital cities Step 2: determine the minimum sample size needed to achieve the level of precision specified at Step 1 Precision target (1): under simple random sampling, a relative margin of error of % in each capital city for any victimization rate higher than P%
= 10% How to determine the over-sampling rates? 2/4
P = 20% How to determine the over-sampling rates? 3/4
Precision target (2): under simple random sampling, an absolute margin of error of % points in each capital city for any victimization rate higher than P% How to determine the over-sampling rates? 4/4
Consider the national victimization rate for the 10 main crimes as used in the International Crime and Victimization Survey (ICVS): Impact on the national accuracy 1/8 Victimization rate in the capital city Victimization rate in the rest of the country
Variance: Impact on the national accuracy 2/8 Relative margin of error: Absolute margin of error:
Case 1: fixed national sample size Impact on the national accuracy 3/8
Impact on the national accuracy 4/8 Table 3: Relative margin of error (%) for the national victimization rate – fixed sample size at national level (Case 1) Country Over-sampling No over-sampling P=0.1P=0.2P=0.3P=0.4P=0.5 France Germany Switzerland Italy Poland Netherlands Portugal Denmark Greece Spain Sweden Finland Norway Ireland Belgium United Kingdom Hungary Austria Estonia
Impact on the national accuracy 5/8 Table 4: Absolute margin of error (% points) for the national victimization rate – fixed sample size at national level (Case 1) Country Over-sampling No over-sampling P=0.1P=0.2P=0.3P=0.4P=0.5 France Germany Switzerland Italy Poland Netherlands Portugal Denmark Greece Spain Sweden Finland Norway Ireland Belgium United Kingdom Hungary Austria Estonia
Case 2: national sample size not fixed Impact on the national accuracy 6/8
Impact on the national accuracy 7/8 Table 5: Relative margin of error (%) for the national victimization rate – national sample size not fixed (Case 2) Country Over-sampling No over-sampling P=0.1P=0.2P=0.3P=0.4P=0.5 France Germany Switzerland Italy Poland Netherlands Portugal Denmark Greece Spain Sweden Finland Norway Ireland Belgium United Kingdom Hungary Austria Estonia
Impact on the national accuracy 8/8 Table 6: Absolute margin of error (% points) for the national victimization rate – national sample size not fixed (Case 2) Country Over-sampling No over-sampling P=0.1P=0.2P=0.3P=0.4P=0.5 France0.7 Germany0.7 Switzerland0.9 Italy0.7 Poland0.8 Netherlands Portugal Denmark Greece0.7 Spain0.6 Sweden0.8 Finland Norway Ireland1.0 Belgium0.8 United Kingdom0.8 Hungary0.6 Austria Estonia0.81.0
Specific issues The initial difficulty is in obtaining the sampling frame appropriate for the over-sampling the inhabitants of the capital cities. For the countries conducting a face-to-face survey, this should not be a serious issue. On the other hand, the countries which plan to conduct the survey by telephone might be unable to do so; unless specific phone numbers are allocated to the households in the capital city (e.g., when the first digits of a phone number represent the city code) Since individuals in capital cities are in general more difficult to contact, over-sampling them will necessitate more attempted contacts; which will likely imply higher costs and more time to reach the minimum sample size required for the survey. Finally, over-sampling might make the problem of anonymisation of the data more acute
Questions for the TF 1. Is over-sampling the habitants of the capital cities policy relevant? Which geographical areas might be over-sampled instead? NUTS2 or NUTS3 regions Groups of cities (like in Eurostat’s Urban Audit) Densely populated areas (based on degree or urbanization) City areas…. 2. What level of accuracy is needed for the capital cities/other geographical areas? 3. What about higher non-response? 4. What about telephone surveys?