Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey.

Slides:



Advertisements
Similar presentations
Multiple Indicator Cluster Surveys Survey Design Workshop
Advertisements

Faculty of Allied Medical Science Biostatistics MLST-201
9. Weighting and Weighted Standard Errors. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
1/26/00 Survey Methodology Sampling, Part 2 EPID 626 Lecture 3.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week.
Sampling with unequal probabilities STAT262. Introduction In the sampling schemes we studied – SRS: take an SRS from all the units in a population – Stratified.
Selection of Research Participants: Sampling Procedures
Multiple Indicator Cluster Surveys Survey Design Workshop
Complex Surveys Sunday, April 16, 2017.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Dr. Chris L. S. Coryn Spring 2012
Who and How And How to Mess It up
Sampling.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
Clustered or Multilevel Data
Why sample? Diversity in populations Practicality and cost.
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
Ratio estimation with stratified samples Consider the agriculture stratified sample. In addition to the data of 1992, we also have data of Suppose.
Stratified Simple Random Sampling (Chapter 5, Textbook, Barnett, V
The Practice of Statistics
Sampling Designs and Techniques
Formalizing the Concepts: Simple Random Sampling.
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. How to Get a Good Sample Chapter 4.
17 June, 2003Sampling TWO-STAGE CLUSTER SAMPLING (WITH QUOTA SAMPLING AT SECOND STAGE)
Sample Design.
NHANES Analytic Strategies Deanna Kruszon-Moran, MS Centers for Disease Control and Prevention National Center for Health Statistics.
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
Definitions Observation unit Target population Sample Sampled population Sampling unit Sampling frame.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Design Effects: What are they and how do they affect your analysis? David R. Johnson Population Research Institute & Department of Sociology The Pennsylvania.
Secondary Data Analysis Linda K. Owens, PhD Assistant Director for Sampling and Analysis Survey Research Laboratory University of Illinois.
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
JENNIFER SAYLOR, PHD, RN, ANCS-BC UNIVERSITY OF DELAWARE SEPTEMBER 14, 2012 Essentials of Complex Data Analysis Utilizing National Survey.
Multilevel Data in Outcomes Research Types of multilevel data common in outcomes research Random versus fixed effects Statistical Model Choices “Shrinkage.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
Sampling Design and Analysis MTH 494 Lecture-30 Ossam Chohan Assistant Professor CIIT Abbottabad.
DTC Quantitative Methods Survey Research Design/Sampling (Mostly a hangover from Week 1…) Thursday 17 th January 2013.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
MEPS WORKSHOP Household Component Survey Estimation Issues Household Component Survey Estimation Issues Steve Machlin, Agency for Healthcare Research and.
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
SAMPLE SELECTION in Earnings Equation Cheti Nicoletti ISER, University of Essex.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling and Sampling Distributions.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 7-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Introduction to Secondary Data Analysis Young Ik Cho, PhD Research Associate Professor Survey Research Laboratory University of Illinois at Chicago Fall,
Chapter 6: 1 Sampling. Introduction Sampling - the process of selecting observations Often not possible to collect information from all persons or other.
Bangor Transfer Abroad Programme Marketing Research SAMPLING (Zikmund, Chapter 12)
Statistics Canada Citizenship and Immigration Canada Methodological issues.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
CASE STUDY: NATIONAL SURVEY OF FAMILY GROWTH Karen E. Davis National Center for Health Statistics Coordinating Center for Health Information and Service.
Sampling technique  It is a procedure where we select a group of subjects (a sample) for study from a larger group (a population)
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Using Data from the National Survey of Children with Special Health Care Needs Centers for Disease Control and Prevention National Center for Health Statistics.
Replication methods for analysis of complex survey data in Stata Nicholas Winter Cornell University
NHANES Analytic Strategies Deanna Kruszon-Moran, MS Centers for Disease Control and Prevention National Center for Health Statistics.
1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
Sample Design of the National Health Interview Survey (NHIS) Linda Tompkins Data Users Conference July 12, 2006 Centers for Disease Control and Prevention.
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
Appropriate use of Design Effects and Sample Weights in Complex Health Survey Data: A Review of Articles Published using Data from Add Health, MTF, and.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Working with the ECLS-B Datasets Weights and other issues.
4 Sampling.
Complex Surveys
Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka
Sampling.
Presentation transcript:

Complexities of Complex Survey Design Analysis

Why worry about this? Many government studies use these designs – CDC National Health Interview Survey (NHIS) – National Health and Nutrition Examination Survey (NHANES) also CDC – National Longitudinal Survey of Youth – Medicare Beneficiary Survey (MCBS) – Almost any survey seeking a representative sample from a large population will have a complex multi-stage probability sampling methodology.

Why do we care? These studies will make their data available to researchers at a very minimal cost (sometimes even free Getting free data seems great but the analysis challenges are considerable as well. The studies do not always document the study design very well so it can be difficult to understand how deal with it.

Today’s talk Will not deal with all of the issues Will start at the basics and lead up to some of the complexities. Will talk about how various software deals with some of the complexities.

Usual assumptions Infinite populations. – Never true but can be “true enough” – Most methods work under the infinite population assumption. – This will hold if N is very, very large and n is not too big relative to N (ie N >> n) – Survey design people are sort of the statistical version of numerical analysts. ie what to do when the analysis environment is not infinite.

Background Types of sampling Simple random sampling with replacement – Easiest to deal with – Population size N sample size n – Each population unit has probability 1/N of being selected to be in the sample. – Drawback – each population unit can be selected multiple times (ie repeat information) – If N is large, the probability of any unit being selected twice is small.

More Background Simple random sampling without replacement – Unequal probability for population unit to be in the sample. – First unit selected has probability 1/N. – Second unit selected has probability 1/(N-1) – nth unit has probability 1/(N-n+1) – if N >> n and N is large then 1/N ≈ 1/(N-1) ≈... ≈ 1/(N-n+1) So approximately the same as simple random sampling with replacement. Can use FPC (finite population correction) ((N-n)/(N-1)) 1/2. Note if N>>n then this is ≈ 1

Why complex sampling Cost (main reason) – simpler and more cost effective May differentially sample easy units versus difficult to sample units. eg homeless, minorities, rural – Harder to sample units Want to account for inclusion difficulty of certain types of population units.

Sampling Strategies Strata Clusters Weights

Strata – Fixed known groups regions, groups of countries states – Not sampled -- however sampling within strata is not equal across strata. – All Strata are included

Adjusting for Strata Assume two strata with N1=100 and N2=10 elements. sample of size 20 from N1 and 8 from N2. Assume with replacement to make the math easier. so P =.2 in strata 1 and P=.8 from strata 2. Use inverse probability to weight analyses weights for strata w1 = 1/.2 =5 and for strata 2 w2 = 1/.8 = 1.25

Example Want to estimate job openings in a town. Large businesses have more job openings than small business. Say that you have 10 large businesses and 100 small business. Sample get a sample of 28 businesses with 20 small businesses and 8 large businesses. Use the probability weights from the previous slide. Let x be the number of job openings in each business.

Example continued Total job openings =  wi xi where the weights are 5 if in strata 1 (small businesses) and weights are 1.25 if in strata 2. Note that w1*n1 + w2*n2 = the population size. So the idea is that businesses sampled from strata 1 look like 5 businesses, while businesses sampled from strata 2 look like 1.25 businesses. Complex survey design works on population totals and the resulting proportions. Note in this case the PSU – primary sampling unit is a business.

With no weights (assumes equal weighting) Cumulative Cumulative open Frequency Percent Frequency Percent Total job openings 202*3.93 = 793 Over estimate because weights large companies equal to small companies. (110/28 = 3.93) 7.2 per company

With weights (unequal sampling) Cumulative Cumulative open Frequency Percent Frequency Percent Total job openings or around 402 (3.6 per company)

types of weights pweights – Inverse probability weights. Also known as sampling weights wi = 1/pi. fweights – Frequency weights. Used when one record represents a number of identical records. aweights -- Analytic weights, are weights that are inversely proportional to the variance of an observation (meta-analysis) iweights – Importance weights weights that indicate the "importance" of the observation in some nonstatistical sense.

Replicate Weights Series of weights used to correct standard errors Used to more securely protect the identity of the respondents Two common kinds – Balanced Repeated Replicates (BRR) – Jack-Knife (JK-1)

Add clustering Strata are fixed groups that are all used and are mutually exclusive – eg Big companies and small companies Clusters are sampled. Unit sampled is the PSU Eg strata Region:Urban/Rural Cluster zip code sample zip codes in region (PSU) Sample person residing in zip code area. Unequal sampling of PSU in strata then unequal sampling of individual in zip code area. Use conditional probabilities to get weights at various levels. Units within a cluster are likely to be more similar (ie smaller variability)

NHANES Sampling design (Continuous) The NHANES sample is designed to be nationally representative of the civilian, non- institutionalized U.S. population, in that it does not include persons residing in nursing homes, institutionalized persons, or U.S. nationals living abroad. Thus, for NHANES , each year's sample and any combination of samples from consecutive years comprise a nationally representative sample of the resident, non-institutionalized U.S. population. Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS). Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS. Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas. Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex- race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.

Weight calculation

Implications of sampling design Strata – no definition of strata – says two county PSUs are selected per strata so strata exist. Variables that sampling is based on – stage 1 : PSU county: size of county(PPS-probability proportional to size so larger counties have greater probability of selection) – stage 2: segment: Size of segment (PPS – see above) – stage 3: household:age, ethnic, income group – stage 4: individual: age-sex-race/ethnicity

Sample weights numerical sample weight assigned to each participant – number of people in the population represented by that particular sampled person – includes adjustments for unequal selection non-response control totals (make sure estimates of age, sex, and race/ethnicity categories match known population totals)

Variance Estimates Unequal weighting causes complications in variance estimation Can use: – Taylor series estimate – BRR – Balanced Repeated Replicates (if weights are provided) get a lot of subsample weights, calculate the estimate a bunch of times and take the variance of these estimates. – Jack Knife (if weights are provided) see above

How? You Can’t do this on your calculator Sudaan (the original) STATA (says it is better) SAS (has come out with survey procedures) Getting variances always seems to be the issue (although unbiased estimates are usually a good thing).

Example of SAS code PROC SURVEYMEANS data=d.ncsdxdm3 ; strata str ; cluster secu ; var deplt1 gadlt1 ; weight p1fwt ; run ;

Example of STATA code svyset county [pw = pwvar], strata(state) fpc(fpcvar) school, fpc(fpcvar2) This sets up the design Use svy: function eg svy: mean svy: regress svy modules are listed in the STATA documentation