Complex Surveys STAT262@UCI.

Slides:

Advertisements

Similar presentations

Basic Sampling Theory for Simple and Cluster Samples

Advertisements

Introduction Simple Random Sampling Stratified Random Sampling

Faculty of Allied Medical Science Biostatistics MLST-201

Complex Surveys Sunday, April 16, 2017.

Dr. Chris L. S. Coryn Spring 2012

Sampling-big picture Want to estimate a characteristic of population (population parameter). Estimate a corresponding sample statistic Sample must be representative.

STAT262: Lecture 5 (Ratio estimation)

The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc

Ratio estimation with stratified samples Consider the agriculture stratified sample. In addition to the data of 1992, we also have data of Suppose.

A new sampling method: stratified sampling

Formalizing the Concepts: Simple Random Sampling.

Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.

Chapter 5: Descriptive Research Describe patterns of behavior, thoughts, and emotions among a group of individuals. Provide information about characteristics.

Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey.

Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.

Definitions Observation unit Target population Sample Sampled population Sampling unit Sampling frame.

Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.

Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.

1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.

Secondary Data Analysis Linda K. Owens, PhD Assistant Director for Sampling and Analysis Survey Research Laboratory University of Illinois.

1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.

Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.

JENNIFER SAYLOR, PHD, RN, ANCS-BC UNIVERSITY OF DELAWARE SEPTEMBER 14, 2012 Essentials of Complex Data Analysis Utilizing National Survey.

© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Design and Analysis MTH 494 Lecture-30 Ossam Chohan Assistant Professor CIIT Abbottabad.

Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.

Lohr 2.2 a) Unit 1 is included in samples 1 and 3.  1 is therefore 1/8 + 1/8 = 1/4 Unit 2 is included in samples 2 and 4.  2 is therefore 1/4 + 3/8 =

Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007.

Introduction to Secondary Data Analysis Young Ik Cho, PhD Research Associate Professor Survey Research Laboratory University of Illinois at Chicago Fall,

Sampling Sources: -EPIET Introductory course, Thomas Grein, Denis Coulombier, Philippe Sudre, Mike Catchpole -IDEA Brigitte Helynck, Philippe Malfait,

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.

Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.

Statistics Canada Citizenship and Immigration Canada Methodological issues.

 When every unit of the population is examined. This is known as Census method.  On the other hand when a small group selected as representatives of.

Basic Business Statistics

Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.

Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.

Estimating standard error using bootstrap

Dr. Unnikrishnan P.C. Professor, EEE

Statistics – Chapter 1 Data Collection

John Loucks St. Edward’s University . SLIDES . BY.

SAMPLING (Zikmund, Chapter 12.

Meeting-6 SAMPLING DESIGN

Sampling with unequal probabilities

Introduction to Survey Data Analysis

Two-Phase Sampling (Double Sampling)

Sampling-big picture Want to estimate a characteristic of population (population parameter). Estimate a corresponding sample statistic Sample must be representative.

Deanna Kruszon-Moran, MS

Ratio and regression estimation STAT262, Fall 2017

Stratified Sampling STAT262.

Complex Surveys Components of a complex survey: random sampling; ratio estimation; stratification; clustering. How to assemble above components into a.

Chapter 7 Sampling Distributions

2. Stratified Random Sampling.

Estimation of Sampling Errors, CV, Confidence Intervals

Data Collection and Sampling

2. Stratified Random Sampling.

Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka

Cluster Sampling STAT262.

Daniela Stan Raicu School of CTI, DePaul University

Daniela Stan Raicu School of CTI, DePaul University

SAMPLING (Zikmund, Chapter 12).

Sampling and Power Slides by Jishnu Das.

Sampling Methods.

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Sampling: How to Select a Few to Represent the Many

Sampling and estimation

Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine

Sadeq R Chowdhury JSM 2019, Denver

Presentation transcript:

Complex Surveys STAT262@UCI

A Brief Review of the “Building Blocks”

Building Blocks of Surveys Sampling methods SRS Stratified Clustering Unequal probability sampling Multiple-stage Estimation methods Weighted mean/sum Ratio/regression estimators

Simple Random Sampling (without replacement) There are (N choose n) possible samples Each with probability 1/(N choose n) Point estimate Standard error

Stratified Sampling The estimate of the population total It’s variance Sample size allocation - proportional, optimal In general, stratified sampling with proportional allocation is more efficient than SRS The more unequal the stratum means, the more benefits

Cluster Sampling Usually less efficient than other methods The relative efficiency of it and SRS depends on intra-class correlation coefficient The larger the correlation coefficient, the less efficient Can reduce cost and lead to administrative convenience One-stage, two-stage, with equal or unequal probs, point estimate, variance, c.i. Allocation of m and n for two-stage cluster sampling

Cluster Sampling without Replacement Select a sample of n clusters with replacement based on Estimate cluster total and variance Estimate population total Variance can be estimated by formulas in ch5,6 or resampling methods

Cluster Sampling with Replacement Select a sample of n clusters with replacement based on Estimate cluster total Calculate Estimate population total and variance

Ratio estimation Biased May results in smaller MSE Useful when variables are linearly correlated Regression estimation

Complex Surveys Large surveys often involve several sampling strategies

Example 1: Malaria in Africa

An example: background Malaria is a common public health problem in tropical and subtropical regions It is infectious. People get it by being bitten by a kind of female mosquitos Without timely and proper treatment, the death rate can be very high Can be prevented by using mosquito nets The prevention is only affective if the nets are in widespread use

Summary Goal: To estimate the prevalence of bed net use in rural areas Sampling frame: all rural villages of <3,000 people in The Gambia

The survey in Gambia (1991) 3000 rural villages Stage Sampling unit Sampling method eastern central western Stratified by region Prob district size 5 districts per region 1 district PHC Non-PHC Stratified by PHC Prob village size 4 villages per district 2 village SRS 6 compounds / village 3 compound Top-down

The survey in Gambia (1991) 3000 rural villages eastern central western Stratified sampling Sampling with unequal probs, two-stage cluster, Ch 6 PHC Non-PHC district Stratified sampling Sampling with unequal probs, two-stage cluster, Ch 6 village compound SRS (average number of nets per compound) Top-down Bottom-up

The survey in Gambia (1991) The way to calculate the estimated total and its variance seems to be complicated It can be worse if we include ratio estimators In practice, we can Use sampling weights to obtain point estimates Use computer intensive methods to obtain standard error (ch9) Such as jackknife, bootstrap

Sampling weights The sampling weight is the reciprocal of Pr(being selected) Each sampled unit “represents” certain number of units in the population The whole sample “represents” the whole population

Sampling weights Weights are used to deal with the effects of stratification and clustering on point estimate Stratified sampling

Sampling weights Cluster sampling with equal probabilities

Sampling weights For three-stage sampling Very large weights are often truncated Biases results Reduces the mean squared error p: primary s: secondary t: tertiary

Sampling weights Weights contain the information needed to construct point estimates Weights do not contain enough information for computing variance Weights can be used to find point estimates because calculating variance requires prob(pairs of units are selected) Computer-intensive methods can be used to find variances

Sampling weights: the malaria example Pr(a compound in central region PHC villages is selected)=

Self-weighting and Non-self-weighting Self-weighting: sampling weights for all observation units are equal A self-weighing sampling is representative of the population if nonsampling errors are ignored Most large self-weighting samples are not SRS Standard software with the usual assumption of iid leads to correct estimate of mean, proportion, percentiles; but erroneous estimation for variance

Ratio Estimation in Complex Surveys Ratio estimation is part of the analysis, not the design Can be used at any level. Usually used near the top

Ratio Estimation in the Malaria example Region level: Above the region level

Ratio Estimation in Complex Surveys The bias of ratio estimation can be large when sample sizes are small Separate ratio estimator for a population total Improves efficiency when ratios vary from stratum to stratum; works poorly for small strata sample sizes Combined ratio estimator for a population total Has less bias when strata sizes are small; works poorly when ratios vary from stratum to stratum

Example 2: The National Health and Nutrition Examination Survey

Combines interviews and physical examinations Designed to assess the health and nutritional conditions of US residents Combines interviews and physical examinations A major program of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC) https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Started from the early 1960s It samples a nationally representative sample of about 5,000 persons each year Interview demographic, socioeconomic, dietary, and health-related questions Examinations medical, dental, and physiological measurements, laboratory tests https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Stage 1: Primary sampling units (PSUs): Most PSUs are single counties Some are groups of contiguous counties Stage 2: Secondary sampling units: city blocks or something similar Stage 3: Tertiary sampling unit: households Stage 4: Individuals https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Select counties (or groups of contiguous counties) Clustering sampling Stage 1 Select counties (or groups of contiguous counties) Clustering sampling Sampling probabilities: probability proportional to a measure of size (PPS) https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Stage 2 Select city blocks Clustering sampling Sampling probabilities: probability proportional to a measure of size (PPS) https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Stratified sampling: sampling from each selected city block Stage 3 Select households Stratified sampling: sampling from each selected city block Sampling probabilities: oversample targeting groups: age, ethnic, or income https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Stratified sampling: sampling from each selected household Stage 4 Select individuals Stratified sampling: sampling from each selected household Sampling probabilities: “individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains” 1.6 persons per household https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Examples of Oversampling 1999-2004 African Americans, Mexican Americans Low income White Americans (from 2000) Adolescents (12-19), seniors (60+) 1971-1974 very low income women of childbearing age https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

Several R packages provide NHANES data RNHANES NHANES data(NHANES): 10,000 rows, 76 variables data(NHANESraw): 20,293 rows, 78 variables Survey data(nhanes): 8,591 rows, 7 variables

Design variables SDMVPSU: Masked Variance Unit Pseudo-PSU not the “true” design PSU’s a collection of secondary sampling units aggregated into groups (called Masked Variance Units) for the purpose of variance estimation produce variance estimates that closely approximate the variances that would have been estimated using the “true” design structure. svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=NHANESraw)

Sampling weights NHANESraw: WTMEC2YR

WTMEC2YR vs WTINT2YR > sum(NHANESraw$WTINT2YR) [1] 608534400 > sum(NHANESraw$WTMEC2YR) > sum(NHANESraw$WTMEC2YR==0) [1] 702 702 subjects had interview data but not MEC data The totals are the same The total weight is about twice of the US population, as the weights are for two years

Estimating a Distribution Function Historically, sampling theory was developed to find population means, totals, and ratios. Other quantities, such ass, Pr(Statistics > means or totals) Median? 95th percentile? Probability mass function? Sampling weights can be used in constructing an empirical distribution of the population

Population quantities and functions Probability mass function (pmf) Distribution function

Empirical Functions Empirical probability mass function Empirical distribution function Empirical functions can be used to estimate population quantities such as mean, median, percentiles, variance, ect.

Plotting data from a complex survey SRS Histograms/smoothed density estimates Scatterplots and scatterplot matrices In a complex sampling design, simple plots can be missleading

Incorporating weights

NHANES

Incorporating weights

NHANES

NHANES

NHANES

NHANES

NHANES

Design effects Cornfield’s ratio (1951) Measure the efficiency of a sampling plan by the ratio of the variance that would be obtained from an SRS of k observation units to the variance obtained from the complex sampling plan with k observation units The design effect (deff, Kish 1965) The reciprocal of Cornfield’s ratio

The design effects The design effect provides a measure of the precision gained/lost by use of the more complex design instead of SRS For estimating a mean

The design effects Stratified Cluster

Design Effects and Confidence Intervals