Exploratory Analysis of Survey Data Lisa Cannon Luke Peterson.

Slides:



Advertisements
Similar presentations
Sampling Design, Spatial Allocation, and Proposed Analyses Don Stevens Department of Statistics Oregon State University.
Advertisements

Distributions of sampling statistics Chapter 6 Sample mean & sample variance.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
ELEC 303 – Random Signals Lecture 18 – Statistics, Confidence Intervals Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 10, 2009.
Multiple regression analysis
Chapter 7 Sampling Distributions
Lecture 23: Tues., Dec. 2 Today: Thursday:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Evaluating Hypotheses
8 Statistical Intervals for a Single Sample CHAPTER OUTLINE
Chapter 2 Simple Comparative Experiments
STAT 4060 Design and Analysis of Surveys Exam: 60% Mid Test: 20% Mini Project: 10% Continuous assessment: 10%
Applications of Nonparametric Survey Regression Estimation in Aquatic Resources F. Jay Breidt, Siobhan Everson-Stewart, Alicia Johnson, Jean D. Opsomer.
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
1 CHAPTER 7 Homework:5,7,9,11,17,22,23,25,29,33,37,41,45,51, 59,65,77,79 : The U.S. Bureau of Census publishes annual price figures for new mobile homes.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination CHAPTER Eleven.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Sample Size Determination CHAPTER thirteen.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Using Resampling Techniques to Measure the Effectiveness of Providers in Workers’ Compensation Insurance David Speights Senior Research Statistician HNC.
Introduction Osborn. Daubert is a benchmark!!!: Daubert (1993)- Judges are the “gatekeepers” of scientific evidence. Must determine if the science is.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Today - Messages Additional shared lab hours in A-269 –M, W, F 2:30-4:25 –T, Th 4:00-5:15 First priority is for PH5452. No TA or instructor Handouts –
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 7-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Introduction to Secondary Data Analysis Young Ik Cho, PhD Research Associate Professor Survey Research Laboratory University of Illinois at Chicago Fall,
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Lecture 4 Ways to get data into SAS Some practice programming
Robust Regression. Regression Methods  We are going to look at three approaches to robust regression:  Regression with robust standard errors  Regression.
Basic Business Statistics
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Chapter ( 2 ) Strategies for understanding the meanings of Data : Learning outcomes Understand how data can be appropriately organized and displayed Understand.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Marginal Distribution Conditional Distribution. Side by Side Bar Graph Segmented Bar Graph Dotplot Stemplot Histogram.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Biostatistics Class 3 Probability Distributions 2/15/2000.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
Parameter, Statistic and Random Samples
CHAPTER 10 Comparing Two Populations or Groups
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Review 1. Describing variables.
Parameter, Statistic and Random Samples
Nonparametric Density Estimation
CHAPTER 29: Multiple Regression*
CONCEPTS OF ESTIMATION
Statistical Assumptions for SLR
Simple Linear Regression
Parametric Methods Berlin Chen, 2005 References:
STATISTICS INFORMED DECISIONS USING DATA
Presentation transcript:

Exploratory Analysis of Survey Data Lisa Cannon Luke Peterson

Presentation Outline Density Estimation  Nonparametric kernel density estimates  Properties of kernel density estimators  Other methods Graphical Displays  NHANES data

Three features that distinguish survey data: 1. Individuals in the sample represent differing numbers of individuals in the population - sampling weights used to estimate this. 2. Some data imputed due to item nonresponse. 3. Sample sizes can be quite large.

The Need for Nonparametric Methods We often study point estimation that assumes iid random variables. Stratification may result in violation of identically distributed random variables Clustering may result in violation of independence Methods we discuss use asymptotic properties that allow nonparametric methods for estimating shape of a distribution

Kernel Density Estimates Bellhouse and Stafford (1999) looked at kernel density estimation for  The whole data set  Binned data (groups the data after it is smoothed)  Smoothing binned data (smooths the data after it is grouped) Asymptotic integrated MSE for model-based and design-based derived.

Why Binning? To simplify estimation of large samples The shape of the data can be distorted by binning Smoothing helps to recover lost structure

Design-Based and Model-Based Different ways to handle the asymptotics Model-based: N finite population units are a sample of identically distributed units from infinite super-population Design Based: A nested sequence of N finite populations, where the distribution function of these populations converges as Weights do not affect bias, but the estimation of variance is inflated by the value for the design effect

Buskirk and Lohr (2005) Also addressed kernel density estimation Considers use of whole data (no binning) Also considered a combination of design- based and model-based approaches Explore conditions for consistency and asymptotic normality Defined confidence bands for the density

Applications Ontario Health Survey US National Crime Victimization Survey (NCVS) US National Health and Nutrition Examination Survey (NHANES)

Other Methods Bellhouse, Stafford (2001)– Polynomial regression methods Bellhouse, Chipman, Stafford (2004)– Additive models for survey data via penalized least squares method Korn et al. (1997) – Smoothing the empirical cumulative distribution function Graubard, Korn (2002)– Variance estimation Many others

Plotting Survey Data Common difficulties with plotting survey data:  Dealing with sampling weights  Plotting a large number of observations can be difficult to interpret See Korn and Graubard (1998).

National Health and Nutrition Survey (NHANES) Has been conducted on a periodic basis since Completes about 7,000 individual interviews annually. Analyzes risk factor for selected diseases and conditions. Sample implemented is a stratified multistage design. Data available at

Glycohemoglobin Level (Ghb) A blood test that measures the amount of glucose bound to hemoglobin. Normally, about 4% to 6%. People with diabetes have more glycohemoglobin than normal. The test indicates how well diabetes has been controlled in the 2 to 3 months before the test. Source:

Histograms Histograms provide a nice summary of the distribution of large data sets. Suppose that we would like to assess the distribution of glycohemoglobin levels. Sampling weights must be considered before plotting a histogram.

SAS Code: Account for Weights proc univariate data=explore.glyco noprint; var glyco; freq weight; histogram / nrows=2 cfill=red midpoints=3 to 15 by 0.5 cgrid=grayDD; run; The variable weight indicates the number of population units the sample unit represents.

Histograms – Effect of Sampling Weights

Boxplots Boxplots indicate location of important summary statistics along with distribution. See Figures 7.8 and 7.10 in Lohr. The boxplot procedure in SAS will not accept any arguments to account for weights. The survey library in R will.

Graphs for Regression – Bubble Plots Scatterplots are inadequate for survey data as they fail to account for sampling weights. Bubble plots incorporate the weights by making the area of each circle proportional to the number of population observations at those coordinates (See Lohr, Chapter 11). The ordinary least squares regression line is then replaced by a weighted least squares line. See Figure 11.5 in Lohr

Bubble Plot for NHANES Data

Dealing with Large Samples Bubble plots are hard to interpret for large data sets due to overlapping bubbles. Potential solutions:  Create a “sampled scatterplot” in which we sample from the original data where probability of selection is proportional to sample weights.  “Jitter” the data by adding some random noise to the values before plotting. These and others discussed in Korn and Graubard (1998).

SAS Code: Plotting a representative subsample proc surveyselect data=explore.glyco out=plotdata method=pps sampsize=300 seed=3452; size weight; run; symbol1 v=circle i=r c=black ci=green w=2; proc gplot data=plotdata; plot glyco*age; run;

Subsample: Glycohemoglobin vs. Age

Plotting Recommendations For univariate displays, adjust for the sampling weights. For scatterplots, sampling weights can be accounted for by using bubble plots. If the sample is large, a subsampling procedure that incorporates the weights might be more appropriate.

References Bellhouse,D.R. and Starfford, J.E. (1999). Density Estimation from complex surveys. Statistica Sinica. Bellhouse, D. R. and Stafford, J.E. (2001). Local polynomial regression in complex surveys. Survey Methodology. Bellhouse, D.R. and Stafford, J.E. (2004). Additive models for survey data via penalized least squares. Technical Report. Buskirk, T.D. and Lohr, S.L. (2005). Asymptotic properties of kernel density estimation with complex survey data. Journal of Statistical Planning and Inference. Graubard, B.I. and Korn E.L. (2002). Inference for superpopulation parameters using sample surveys. Statistical Science. Korn, E.L., Midthune, D., and Graubard, B.I. (1997). Estimating interpoloated percentiles from grouped data with large samples. J. Official Statist. Korn, E.L. and Graubard, B.I. (1998). Scatterplots with survey data. The American Statistician.