Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding and Evaluating Survey Data Documentation: A User’s Guide

Similar presentations


Presentation on theme: "Understanding and Evaluating Survey Data Documentation: A User’s Guide"— Presentation transcript:

1 Understanding and Evaluating Survey Data Documentation: A User’s Guide
Chase H. Harrison, Ph.D. Harvard University Department of Government and IQSS

2 Overview of Talk What is a survey? Where might you look?
What does it look like? What do you look for? Variables and questions Study information Special Topic: Survey Weights

3 Where might you look?

4 Survey Data “Survey Research” can mean different things
Generally includes Structured, standardized data collection Consistent, comparable measures that can be easily compared and quantified Samples from well-defined sample frame Ability to understand specific population of inference Samples allow statistical projection with measurable or estimable precision

5 A Total Survey Error Perspective
Responses Measures Concepts Edited Responses Respondents Sample Frame Population Internal Validity External Validity Adjusted Data Survey Statistic Adapted from Groves, et. al., 2004

6

7 Types of Surveys Cross-Sectional Survey Panel Survey
Longitudinal Survey Multiple Populations Multiple Modes for Same respondent Different modes for different respondents Multiple Reporters Nested populations Repeated Cross-section

8 Caveats on Composite Datasets
Sample frames, sources, etc., may have changed Question wording may not be identical Modes may be different Question-order may be different

9 Descriptive Surveys Make population projections Use of point estimates Generally require probability-based sampling Stricter requirements for field methodology Analytic Surveys Often thought of as model-based inference Generally require probability samples More typical for social science Looking at relationships Inference is primarily based on the model used Are appropriate variables included and controlled for in analysis? Design-based inference May or may not require probability samples

10 Where might you look?

11

12 Survey Data Sources Aggregate Data Micro-Level Sources
Question-level data Statistical reports Micro-Level Sources Datasets Examples: General Social Survey American National Election Survey Pew surveys Others

13 Popular Sources for Micro-Level Data
Start with the IQSS Dataverse…… ICPSR Roper Center US Census and other federal sources Pew Research Centers Many others

14 Identifiability and Usage
Confidentiality and privacy are issues in survey research Some geographic or demographic information often removed, collapsed or disguised Usage typically conditional on not attempting to re-identify data or identify respondents

15 Restricted Use More detailed data may be available with special restricted use agreements Federal Data: Census Data Center (at NBER) allows limited negotiated access of identifiable federal data.

16 What might you find?

17 Overview of Documentation
Typical Files Dataset Codebook Complete Questionnaire Description of Study methodology Other files Sample frequencies, Unit-level data, Contextual-level data Linkages to other studies

18 Methodology Description
Statistical Dataset Raw data Contextual Data Sample or Frame Data Study Documentation Methodology Description Full Questionnaire Other Appendices

19 What a dataset might look like…. (viewed in STATA)
Variables Respondents

20 Viewing the dataset with labels…..

21 If you don’t have a dataset….
Raw Data Data Definition Code Statistical File

22 Typical Dataset Formats
Less Processed *.txt column delimited text *.dat tab or column text *.csv comma delimited More Processed *.sav SPSS format *.por SPSS format *.dta STATA *.sas7bdat SAS format *.xpt SAS format

23 A raw dataset (in tab and comma delimited format)

24 Part of a data definition file (STATA commends to label variables

25 Items that may be included in data definition files
Variable names Variable labels Format (numeric, string, etc.) Length Missing values Level (i.e. nominal, continuous, etc.)

26 Dataset Notes Generally, statistical files (e.g. .sav, .dta) contain more information than raw files Many programs can read or write other programs STAT transfer converts between many packages IQSS / HMDC lab (1737 Cambridge St. basement) has computers with most statistical packages.

27 General Dataset Principles (briefly)
Always try to keep an ID that links your data file to the sample frame Always keep an “original” backup of files (with documentation of source, date, etc.) Better to put all recodes into new variables Alternative: Never overwrite a file Always pay attention to whether the analysis should use weights. (More Later)

28 Question Level Information
What to Look For Question Level Information

29 A Total Survey Error Perspective
Responses Measures Concepts Edited Responses Respondents Sample Frame Population Internal Validity External Validity Adjusted Data Survey Statistic Adapted from Groves, et. al., 2004

30 Levels of Generalization
What actually is asked Instructions and probes not always documented Survey Question May or may not have labels or codes May or may not be included in dataset Skipped respondents may not be documented Labels may or may not reflect actual question wording Raw variable May be provided May not be provided May not be completely documented Are there missing values? [If so, why?] Coded variable May be analyzed at different level (i.e. interval, ordinal) than measurement Probably excludes missing values May have different labels Analyzed variable

31 Dataset Notes Variable names may not reflect actual question wording

32 Common Triage on Variables
Run summary statistics on all files Variable names + variable labels Codes and labels for values Compare output to documentation Observe patterns

33 More triage…. Compare cases with data to cases that should have data
Run contingency tables (crosstabs) to validate skip logic Note any undocumented variables (what can they tell you?)

34 Missing Values May be defined with code but declared missing for analysis May be simply missing or empty Reasons for missing data Variable not relevant (skipped) Information implied form previous response Intentionally missing (e.g. ½ sample) “Don’t know”, “Refused”, Non-responsive Programming errors For self-administered surveys, meaning of missing values less clear Level of processing can very depending on data producer

35 Composite Variables Survey process doesn’t always match analytic process Single “variables” may be constructed from multiple questions Depending on producer, more or less processing may have been performed Processing implies both actions and decisions Not always well documented Not always consistent with assumptions you may want to make……

36 An Example: Party Identification
Common political science variable e.g. Democrats mostly voted for Clinton, while Republicans mostly voted for Trump….. Common measurement is seven points Common analysis treats as recoded nominal, ordinal or interval

37 How this might be asked in a web survey
Which best describes your political party? Democrat Independent Republican

38 In an interviewer-administered survey…….
PID1: In politics as of today, would you consider yourself a Democrat, a Republican, an Independent, or something else? PID2: Would you say you are a strong [Democrat/Republican], or a not very strong [Democrat/Republican] PID3: Which way do you lean, Republican, or Democrat?

39 Labels in Dataset (+ Skip Logic)
Variable Code and Label in Dataset Logic from Codebook PID1 1 Democrat 2 Republican 3 Something else 98 Don’t know 99 Refused {SKIP TO PID2} [VOL] {SKIP TO PID3} PID2 1 Strong 2 Not very strong 98 Don’t know [VOL] 99 Refused [VOL] [VOL] PID3 1 Republican 2 Democrat 3 Neither/Both 4 Other 98 99

40 Possible Recode Code Label Logic 1 Strong Democrat
(PID1=1) and (PID2=1) 2 Weak Democrat (PID1=1) and ((PID2=2) OR (PID2=98) OR (PID2=99)) 3 Lean Democrat PID3=2 4 Independent PID3=3 5 Lean Republican (PID3=1) 6 Weak Republican (PID1=2) and ((PID2=2) OR (PID2=98) OR (PID2=99)) 7 Strong Republican (PID1=2) and (PID2=1) 97 Other (PID3=4) 98 Don’t know (PID3=98) 99 Refused (PID3=99)

41 Notes Dataset may or may not include the composite variable
Composite variable will not capture all nuances of question Composite recode may not match your assumptions You may want to do further collapsing or recoding for analysis

42 Another example……. INCOME: Is your annual household income before taxes, [ ] or more, or less than [ ]? Inc1 Inc2 Inc3 Range Less than $15K $15,000 $15K - <$30K $30,000 $30K - <$50K $75,000 $50,000 $50K - < $75K $85,000 $75K - <$85K $100,000 $85K - <$100K $150,000 $100K - <$150K $150K + Don’t know ?? Refused

43 Ways to include in dataset
Keep as ordinal [1, 2, 3, 4, etc.] Recode to Interval with median values as codes ($7,000, $22,500, $40,000, etc.) Break into nominal categories ($75K + or not) How to treat partially missing data? Do you impute missing values?

44 Study level information
What to Look For Study level information

45 A Total Survey Error Perspective
Responses Measures Concepts Edited Responses Respondents Sample Frame Population Internal Validity External Validity Adjusted Data Survey Statistic Adapted from Groves, et. al., 2004

46

47

48

49 A moderately detailed codebook

50 Basic Information

51 Data Collection… Five pages overall…..

52 Method or mode of data collection
A few brief notes Telephone, web, face-to-face In some cases, multiple modes are used Distinctions between aural and visual modes Method or mode of data collection Percent of contacted sample records who completed survey Specific standards exist (i.e. RR4, etc.) Some journals require for publication Nonresponse bias is more complex that simple rates Response Rates Probability samples generally require much more information than non-probability samples Key distinction is between descriptive surveys versus analytic surveys Sampling Approach

53 Probability Sampling and Populations
The population the survey was designed to cover Target Population The list or set of procedures from which respondents are selected Multiple frames increasingly used Sample Frame Stratified Multi-stage sampling Sampling Methods

54 Why do people talk about survey weights?

55 Non-probability Samples
Required for strict population generalization Require a well-defined and well documented sample frame Each member of population has known non-zero chance of selection Must be tractable (i.e. well-documented and replicable) Non-probability Samples Generally only able to generalize to itself Possibly useful for model-based inference or experiments Convenience samples Availability samples Opt-in internet samples

56 Sample from a codebook…..
three more pages……

57 Main reasons survey data is weighted
Sampling Disproportionate stratification Disproportionate probabilities of selection Nonresponse Sometimes called post-stratification Adjusts observed data to population estimates Usually adjusts for coverage as well

58 Other types of weights…
Different units of observation versus analysis Different unit of measurement and analysis To expand data to a larger population Adjust analytic importance of different groups

59 How to analyze weights At a minimum, adjust for weights to make population estimates More complex approaches required to adjust variance

60 Complex survey analysis
Procedures to adjust for weights, stratification, clustering, and other sample design effects Procedures available in STATA: svy commands R: Package ‘survey’ SPSS: Complex samples SUDAAN SAS: SURVEY procs

61 Harvard Program on Survey Research
For more information…. Survey Advising Harvard Program on Survey Research Chase H. Harrison, Ph.D.


Download ppt "Understanding and Evaluating Survey Data Documentation: A User’s Guide"

Similar presentations


Ads by Google