Understanding and Evaluating Survey Data Documentation: A User’s Guide

Understanding and Evaluating Survey Data Documentation: A User’s Guide
Chase H. Harrison, Ph.D. Harvard University Department of Government and IQSS

Overview of Talk What is a survey? Where might you look?
What does it look like? What do you look for? Variables and questions Study information Special Topic: Survey Weights

Where might you look?

Survey Data “Survey Research” can mean different things
Generally includes Structured, standardized data collection Consistent, comparable measures that can be easily compared and quantified Samples from well-defined sample frame Ability to understand specific population of inference Samples allow statistical projection with measurable or estimable precision

A Total Survey Error Perspective
Responses Measures Concepts Edited Responses Respondents Sample Frame Population Internal Validity External Validity Adjusted Data Survey Statistic Adapted from Groves, et. al., 2004

Types of Surveys Cross-Sectional Survey Panel Survey
Longitudinal Survey Multiple Populations Multiple Modes for Same respondent Different modes for different respondents Multiple Reporters Nested populations Repeated Cross-section

Caveats on Composite Datasets
Sample frames, sources, etc., may have changed Question wording may not be identical Modes may be different Question-order may be different

Descriptive Surveys Make population projections Use of point estimates Generally require probability-based sampling Stricter requirements for field methodology Analytic Surveys Often thought of as model-based inference Generally require probability samples More typical for social science Looking at relationships Inference is primarily based on the model used Are appropriate variables included and controlled for in analysis? Design-based inference May or may not require probability samples

Where might you look?

Survey Data Sources Aggregate Data Micro-Level Sources
Question-level data Statistical reports Micro-Level Sources Datasets Examples: General Social Survey American National Election Survey Pew surveys Others

Popular Sources for Micro-Level Data
Start with the IQSS Dataverse…… ICPSR Roper Center US Census and other federal sources Pew Research Centers Many others

Identifiability and Usage
Confidentiality and privacy are issues in survey research Some geographic or demographic information often removed, collapsed or disguised Usage typically conditional on not attempting to re-identify data or identify respondents

Restricted Use More detailed data may be available with special restricted use agreements Federal Data: Census Data Center (at NBER) allows limited negotiated access of identifiable federal data.

What might you find?

Overview of Documentation
Typical Files Dataset Codebook Complete Questionnaire Description of Study methodology Other files Sample frequencies, Unit-level data, Contextual-level data Linkages to other studies

Methodology Description
Statistical Dataset Raw data Contextual Data Sample or Frame Data Study Documentation Methodology Description Full Questionnaire Other Appendices

What a dataset might look like…. (viewed in STATA)
Variables Respondents

Viewing the dataset with labels…..

If you don’t have a dataset….
Raw Data Data Definition Code Statistical File

Typical Dataset Formats
Less Processed *.txt column delimited text *.dat tab or column text *.csv comma delimited More Processed *.sav SPSS format *.por SPSS format *.dta STATA *.sas7bdat SAS format *.xpt SAS format

A raw dataset (in tab and comma delimited format)

Part of a data definition file (STATA commends to label variables

Items that may be included in data definition files
Variable names Variable labels Format (numeric, string, etc.) Length Missing values Level (i.e. nominal, continuous, etc.)

Dataset Notes Generally, statistical files (e.g. .sav, .dta) contain more information than raw files Many programs can read or write other programs STAT transfer converts between many packages IQSS / HMDC lab (1737 Cambridge St. basement) has computers with most statistical packages.

General Dataset Principles (briefly)
Always try to keep an ID that links your data file to the sample frame Always keep an “original” backup of files (with documentation of source, date, etc.) Better to put all recodes into new variables Alternative: Never overwrite a file Always pay attention to whether the analysis should use weights. (More Later)

Question Level Information
What to Look For Question Level Information

Levels of Generalization
What actually is asked Instructions and probes not always documented Survey Question May or may not have labels or codes May or may not be included in dataset Skipped respondents may not be documented Labels may or may not reflect actual question wording Raw variable May be provided May not be provided May not be completely documented Are there missing values? [If so, why?] Coded variable May be analyzed at different level (i.e. interval, ordinal) than measurement Probably excludes missing values May have different labels Analyzed variable

Dataset Notes Variable names may not reflect actual question wording

Common Triage on Variables
Run summary statistics on all files Variable names + variable labels Codes and labels for values Compare output to documentation Observe patterns

More triage…. Compare cases with data to cases that should have data
Run contingency tables (crosstabs) to validate skip logic Note any undocumented variables (what can they tell you?)

Missing Values May be defined with code but declared missing for analysis May be simply missing or empty Reasons for missing data Variable not relevant (skipped) Information implied form previous response Intentionally missing (e.g. ½ sample) “Don’t know”, “Refused”, Non-responsive Programming errors For self-administered surveys, meaning of missing values less clear Level of processing can very depending on data producer

Composite Variables Survey process doesn’t always match analytic process Single “variables” may be constructed from multiple questions Depending on producer, more or less processing may have been performed Processing implies both actions and decisions Not always well documented Not always consistent with assumptions you may want to make……

An Example: Party Identification
Common political science variable e.g. Democrats mostly voted for Clinton, while Republicans mostly voted for Trump….. Common measurement is seven points Common analysis treats as recoded nominal, ordinal or interval

How this might be asked in a web survey
Which best describes your political party? Democrat Independent Republican

In an interviewer-administered survey…….
PID1: In politics as of today, would you consider yourself a Democrat, a Republican, an Independent, or something else? PID2: Would you say you are a strong [Democrat/Republican], or a not very strong [Democrat/Republican] PID3: Which way do you lean, Republican, or Democrat?

Labels in Dataset (+ Skip Logic)
Variable Code and Label in Dataset Logic from Codebook PID1 1 Democrat 2 Republican 3 Something else 98 Don’t know 99 Refused {SKIP TO PID2} [VOL] {SKIP TO PID3} PID2 1 Strong 2 Not very strong 98 Don’t know [VOL] 99 Refused [VOL] [VOL] PID3 1 Republican 2 Democrat 3 Neither/Both 4 Other 98 99

Possible Recode Code Label Logic 1 Strong Democrat
(PID1=1) and (PID2=1) 2 Weak Democrat (PID1=1) and ((PID2=2) OR (PID2=98) OR (PID2=99)) 3 Lean Democrat PID3=2 4 Independent PID3=3 5 Lean Republican (PID3=1) 6 Weak Republican (PID1=2) and ((PID2=2) OR (PID2=98) OR (PID2=99)) 7 Strong Republican (PID1=2) and (PID2=1) 97 Other (PID3=4) 98 Don’t know (PID3=98) 99 Refused (PID3=99)

Notes Dataset may or may not include the composite variable
Composite variable will not capture all nuances of question Composite recode may not match your assumptions You may want to do further collapsing or recoding for analysis

Another example……. INCOME: Is your annual household income before taxes, [ ] or more, or less than [ ]? Inc1 Inc2 Inc3 Range Less than $15K $15,000 $15K - <$30K $30,000 $30K - <$50K $75,000 $50,000 $50K - < $75K $85,000 $75K - <$85K $100,000 $85K - <$100K $150,000 $100K - <$150K $150K + Don’t know ?? Refused

Ways to include in dataset
Keep as ordinal [1, 2, 3, 4, etc.] Recode to Interval with median values as codes ($7,000, $22,500, $40,000, etc.) Break into nominal categories ($75K + or not) How to treat partially missing data? Do you impute missing values?

Study level information
What to Look For Study level information

A moderately detailed codebook

Basic Information

Data Collection… Five pages overall…..

Method or mode of data collection
A few brief notes Telephone, web, face-to-face In some cases, multiple modes are used Distinctions between aural and visual modes Method or mode of data collection Percent of contacted sample records who completed survey Specific standards exist (i.e. RR4, etc.) Some journals require for publication Nonresponse bias is more complex that simple rates Response Rates Probability samples generally require much more information than non-probability samples Key distinction is between descriptive surveys versus analytic surveys Sampling Approach

Probability Sampling and Populations
The population the survey was designed to cover Target Population The list or set of procedures from which respondents are selected Multiple frames increasingly used Sample Frame Stratified Multi-stage sampling Sampling Methods

Why do people talk about survey weights?

Non-probability Samples
Required for strict population generalization Require a well-defined and well documented sample frame Each member of population has known non-zero chance of selection Must be tractable (i.e. well-documented and replicable) Non-probability Samples Generally only able to generalize to itself Possibly useful for model-based inference or experiments Convenience samples Availability samples Opt-in internet samples

Sample from a codebook…..
three more pages……

Main reasons survey data is weighted
Sampling Disproportionate stratification Disproportionate probabilities of selection Nonresponse Sometimes called post-stratification Adjusts observed data to population estimates Usually adjusts for coverage as well

Other types of weights…
Different units of observation versus analysis Different unit of measurement and analysis To expand data to a larger population Adjust analytic importance of different groups

How to analyze weights At a minimum, adjust for weights to make population estimates More complex approaches required to adjust variance

Complex survey analysis
Procedures to adjust for weights, stratification, clustering, and other sample design effects Procedures available in STATA: svy commands R: Package ‘survey’ SPSS: Complex samples SUDAAN SAS: SURVEY procs

Harvard Program on Survey Research
For more information…. Survey Advising Harvard Program on Survey Research Chase H. Harrison, Ph.D.

Understanding and Evaluating Survey Data Documentation: A User’s Guide

Similar presentations

Presentation on theme: "Understanding and Evaluating Survey Data Documentation: A User’s Guide"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding and Evaluating Survey Data Documentation: A User’s Guide

Similar presentations

Presentation on theme: "Understanding and Evaluating Survey Data Documentation: A User’s Guide"— Presentation transcript:

Similar presentations

About project

Feedback