Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adjustments to the survey design: Sampling

Similar presentations


Presentation on theme: "Adjustments to the survey design: Sampling"— Presentation transcript:

1

2 Adjustments to the survey design: Sampling

3 Introduction to DHS Survey Design and Sampling
Sampling Frame EAs / clusters Households individuals Stratified Two-stage cluster sampling design. In the first stage, a Probability Proportional to Size (PPS) sample of Enumeration Areas (EAs). In the second stage, a systematic sample of households. The Enumeration Area: a geographical statistical unit created for a census and containing a certain number of households. Household listing process: List the selected clusters to create a frame of HHs for the second stage of selection. No within HH selection: All eligible members; all eligible women, men and children are included, except for DV module.

4 Introduction to DHS Survey Design and Sampling
Sampling Strata: for sample selection; Region by urban/rural. Survey domain: survey estimates reporting level; Regions. Region 1 Urban Rural Region 2 Sampling Frame EAs / clusters Sampling Frame EAs / clusters Sampling Frame EAs / clusters Sampling Frame EAs / clusters

5 Introduction to DHS Survey Design and Sampling
The sample size: To produce reliable TFR and MR, about completed women per survey domain. Other indicators may require smaller sample size. The total sample size should consider the number of domains. Sample allocation has to take this minimum sample size into account.

6 Introduction to DHS Survey Design and Sampling
Strata HH population (millions) Region1/Urban 4 Region1/Rural 3 Region2/Urban 2 Region2/Rural 1 Total 10 As an illustrative example, assuming: one woman per HH; 100% of HH and Women Response Rate

7 Introduction to DHS Survey Design and Sampling
Strata HH population (millions) Selected HH (proportional allocation) Region1/Urban 4 1600 Region1/Rural 3 1200 Region2/Urban 2 800 Region2/Rural 1 400 Total 10 4000 Assuming a total sample size of 4000 HHs (women), a proportional allocation will not satisfy the domain analysis objective if the domain and the strata are the same.

8 Introduction to DHS Survey Design and Sampling
Strata HH population (millions) Selected HH (proportional allocation) (power allocation) Region1/Urban 4 1600 1120 Region1/Rural 3 1200 1020 Region2/Urban 2 800 980 Region2/Rural 1 400 880 Total 10 4000 Power allocation: sample allocation proportional to a powered value of the measure of size (population here), power between 0 and 1

9 Introduction to DHS Survey Design and Sampling
Strata HH population (millions) Selected HH (proportional allocation) (power allocation) Sample clusters Region1/Urban 4 1600 1120 56 Region1/Rural 3 1200 1020 51 Region2/Urban 2 800 980 49 Region2/Rural 1 400 880 44 Total 10 4000 200 Assuming a sample take of 20 households per cluster.

10 Introduction to DHS Survey Design and Sampling
Sampling Frame EAs / Clusters 1st stage of selection Households 2nd stage of selection individuals Target population 1st stage selection probability P1 2nd stage selection probability P2 Household response HH RR

11 Introduction to DHS Survey Design and Sampling
Probability proportional to size (PPS) selection in the first stage, equal probability selection at the second stage All EAs/households have a chance (probability) to be selected Each selection stage is accompanied by selection probabilities The multiplication of all selection probabilities composes the overall selection probability of the sample unit. P = P1 * P2

12 Introduction to DHS Survey Design and Sampling
The inverse of the overall selection probability is the “Design weight”. The design weight is the reflection of the sample selection mechanism in the analysis The design weights are adjusted for household response rate. The adjustment step is the reflection of the sample unit response in the analysis. D = 1/P W = D/(RR)

13 Adjustments to the survey design: Analysis

14 Introduction to DHS Survey Design and Sampling
Weight normalization to make the weighted sample size equal the actual sample size (a DHS tradition). The normalized weight can be used to produce any kind of indicators except the totals. Not for pooled data. Case-dependent weights not variable-dependent.

15 Analyzing DHS data: How to take the survey design into account
Question: “In the “Guide to DHS Statistics” it is not recommended to do weighted analysis if one is attempting regression analysis, but I have come across another point of view….” This recommendation is out of date and will be removed from the next version. We always adjust for weighting, clustering, and stratification during analysis and strongly recommend that others do so. Some users may prefer not to weight but that is now a minority position among statisticians and social scientists. When checking data quality, looking for outliers, confirming recode commands, etc., you do not need to make these adjustments.

16 How a DHS survey differs from a Simple Random Sample
Question: Several people have asked about how DHS data differ from simple random samples (SRS) and how the adjustments correct for that. This is really the fundamental issue for the “analysis” part of this webinar. All standard statistical estimation and testing procedures are based on the two SRS assumptions: All cases in the population have the same probability of appearing in the sample; Each case is sampled independently of all other cases. DHS sampling procedures, described earlier, like all multi-stage cluster samples, violate both of these assumptions.

17 Three adjustments should be made during analysis
I will describe why each adjustment should be made, and what impact it has on the analysis, under these labels: The weight adjustment The cluster adjustment The stratification adjustment

18 The weight adjustment As described earlier, at the beginning of the sampling design the country is geographically stratified. The strata are combinations of region and urban/rural. The population size for each stratum is estimated. Strata with smaller population sizes will then be over-sampled. Strata with larger population sizes will be under-sampled. The sample will be somewhat more uniformly distributed across the strata than the population is.

19 The weight adjustment To compensate, during analysis the over-sampled strata must be “weighted down” and the under-sampled strata must be “weighted up”. The weights are also determined by the corrections to the sampling frame, non-response, etc., such that after weighting, the sample becomes representative and estimates will be unbiased. If you do not weight, the estimates will be biased toward the over-sampled strata and the households with the best response rates, etc. If you do not weight, it will be impossible—or very difficult—to make comparisons with previous surveys in the same country or other surveys from other countries.

20 Two basic questions about weights
Question: “Could you please talk about weighting for different types of outcomes? For example, for diarrhea or for malnutrition….” Weights are specific to the cases. They have nothing to do with the outcomes or predictors. This is a common misunderstanding of the weights. The weight is not linked to the variables, except for the way in which certain variables may only apply to certain units (household, woman, man, child, couple, etc.). Question: “How to use the weights when sub-sampling for analysis—for example, only using women of a specific age range?” The same weights would be used for a subsample that are used for the full sample. In Stata, the weights are always re-normalized so that the sum of the weighted cases is the same as the sum of the observed cases. You do not have to do that step.

21 The cluster adjustment
Clusters are enumeration areas (EAs), generally equivalent to villages in rural areas or neighborhoods in urban areas. The reason for using clusters as the primary sampling units (PSUs) is to reduce the cost of data collection. However, when clusters are used, the SRS assumption of independent observations is violated. The clusters are independent but the households within the same cluster are NOT independent. Households, etc., within the same cluster, tend to be similar.

22 The cluster adjustment
There tends to be less variability in a cluster sample than in a simple random sample. The adjustment for clustering is equivalent to scaling the non-SRS sample size down to a comparable, but smaller, SRS sample size. The adjustment will not affect means, rates, coefficients, etc., at all. After the adjustment, typically (there can be exceptions) --standard errors will increase; --test statistics will be closer to zero; --p values will increase (become less significant); --confidence intervals will get wider.

23 The stratification adjustment
A stratified sample is actually better than a simple random sample, because you force the distribution of the sample, across the strata, to match the distribution of the population across the strata. You are effectively increasing the sample size, compared to SRS. The adjustment will not affect means, rates, coefficients, etc., at all. After the adjustment, typically (there can be exceptions) --standard errors will decrease; --test statistics will be farther from zero; --p values will decrease (become more significant); --confidence intervals will get narrower.

24 Summary of the impact of the adjustments
If you weight, the means, coefficients, etc. will change and will become unbiased; The standard errors will tend to increase If you make the cluster adjustment, The means, coefficients, etc. will not change at all If you make the stratification adjustment, The means, coefficients, etc., will not change at all The standard errors will tend to decrease Changes in the standard errors will be different for different variables

25 Do your own investigation!
I encourage you to investigate the effect of these three adjustments, individually, in groups of two, and all together. First do a baseline or reference logit regression, for example, without any adjustments. Then repeat the run with the seven possible combinations of the adjustments. For each run, carefully compare the coefficients with the baseline coefficients, and compare the standard errors with the baseline standard errors. Ideally, repeat with other outcomes and covariates. This will give you a sense of the impact of the adjustments.

26 User Questions

27 Qs from Baremma and Alitasso Answer from DHS
Sampling weight reflects the sample selection mechanism. Weight is necessary for any analysis which inferences the total target population. De-normalize the weights is necessary for pooling different surveys together We can link DHS data with external data. Are sampling weights to be dropped in this case? When combining surveys for a single country spanning multiple years, with the intention of making comparisons across years, how do you properly weight the data?

28 Question Answer from DHS
Several users have asked about which weight to use—for example, the household weight, woman’s weight, etc., when there is more than one version of the weight variable. Usually there is no ambiguity about which weight to use, but when you merge files you may get more than one weight. The general rule, in such cases, is to use the weight from the units that tend to have higher nonresponse rates. (see next slide)

29 Priority of the different weights
For households (HR file) and persons in households (PR file) use hv005. For women (IR) and children of women (BR) and (KR) use v005. For men (MR) and couples (CR) use mv005. Men tend to have more non-response than women, so you need to use the correction for men’s non-response when studying couples. When using any variables from the domestic violence module, use dv005. This is because in the DV module, only one woman per household is chosen. In addition, this module is not implemented when privacy is not possible. When using any variables for individual from the HIV testing (in the AR file), use hiv05. You need to use the correction for higher rates of non-response for testing. When using any variables for couples from the HIV testing (using a merge of the CR and AR file, for example to study HIV discordance), use hiv05 from the man’s record.

30 What to do if the weights must be integers?
You can use the weights as they are, with a factor of 1 million. This will not affect means, coefficients, etc. It will, however, mean that the standard errors are too small, by a factor of 1/1000 (standard errors are inversely proportional to the square root of the sample size). You need to multiply the standard errors by 1000 and similarly change the confidence intervals and test statistics. Never, for any reason, divide these weights by and then round to an integer. Many weights would then round to 0, and those cases would be dropped from any calculations that used weights. To repeat: Stata always re-normalizes pweights to have a mean of 1, so that the total weighted sample size = the total unweighted sample size.

31 Using sampling weights is always recommended
Q from MBruederle Answer from DHS I would like to use DHS data to calculate development outcomes at the local level (clusters or sub-national administrative areas). ….. Should I still apply sampling weights ? Des-aggregate DHS data to lower than regional level is not recommended and not guaranteed Using sampling weights is always recommended

32 Question Answer from DHS
Several users have asked whether estimates at the district level, etc., are acceptable Estimates at any geographic level--or for any subpopulation--will be unbiased. We discourage estimates for small geographic areas because of small sample size, especially in terms of a small number of clusters.

33 Q from MBruederle Answer from DHS
In general, where do I find information on which covariates underlie the sample selection for each survey? Next to geographic distribution and urban / rural, are households selected (and weights assigned) based on other covariates like ethnicity, family size, ..., to ensure that the sample is nationally representative? In the sample design document of the final survey report, usually the appendix A No, HH is not selected based on any covariate. So the sample may not be representative for a specific ethnicity group, especially when it is small

34 Question Answer from DHS
A Stata question: several users report that svyset will not work properly; the run terminates with an error message. Try one of the svyset versions on the next page. They give very similar results. I prefer “centered” or “certainty”.

35 Full svyset command, including the singleunit option
svyset v001 [pweight=v005], strata(strata) singleunit(centered) svyset v001 [pweight=v005], strata(strata) singleunit(scaled) svyset v001 [pweight=v005], strata(strata) singleunit(certainty) The cluster id code is v001 or v021 (both will work) The pweight will be v005, hv00v, mv005, etc. egen strata=group(v024 v025) unless you are sure you already have the strata variable

36 Thank you for joining us!
All materials from today’s webinar will be available on The DHS Program User Forum: userforum.dhsprogram.com Coming Next: Webinar on GIS Data in July Please complete the survey that will appear at the end of our session.


Download ppt "Adjustments to the survey design: Sampling"

Similar presentations


Ads by Google