Deanna Kruszon-Moran, MS

Deanna Kruszon-Moran, MS
NHANES Analytic Strategies Deanna Kruszon-Moran, MS Centers for Disease Control and Prevention National Center for Health Statistics

Analyzing Data NHANES 1999-2004 Preparing your data files
Downloading Documentation files are now in Adobe PDF format. Can be viewed or accessed directly via the web link For the first two cycles, documentation, codebook and data frequencies were provided on 3 separate files. As of they can be found in one file. Each survey cycle includes demographic, questionnaire, exam and lab files. Clicking on the data link will allow you to store the data file or open it directly with SAS. Data files are in SAS transport (.xpt) format.

Read the documentation !! Read the documentation!!
Know your data Read the documentation !! Read the documentation!!

Preparing your data files
Merging: Merge all files by sequence number to the demographic file. Verify the numbers of records merged and the final sample number against the published frequencies on the web. Be sure they are what you expected and all merges worked correctly.

Know your data Run basic frequencies and cross tabulations.
Know your target population. Understand how item was measured (how is the item defined, topcoded, recoded) Recode variables as necessary (example: age groups, positive/negative lab tests, high/low BP, high/low cholesterol etc.). Recode unknown/refusals as missing data (77, 99 recode to missing). Check your coding – run frequencies in SAS.

Know your data Continuous Outcome Data:
Look for outliers in your measure. Run Proc Univariate. Look for outliers among the weights. Use Proc Univariate on the weight variable. Outlying variables especially those with large weights can really influence your estimates. Look at normality. Consider transformations. Log, square root, power.

Analyzing within NHANES 1999-2004
Things to consider: Data released in two year cycles. We STRONGLY RECOMMEND using two or more cycles (4 or more years ) to produce reliable estimates. Verify data items collected were comparable in wording and methods. When combining years remember to use correct combined weights.

Analyzing trends with NHANES NHANES III to NHANES 1999-2004
Things to consider: What is your target sample from each survey–age? How different was the question worded or the interview methods ? How different were the lab or exam methodologies ? Cutoffs used? Definitions? For current NHANES sample sizes may be smaller depending on number of years measured - especially in sub domains.

NHANES Sample Design NHANES is a complex, multistage,
probability cluster design of the civilian, noninstitutionalized US population.

Sample Weights To analyze NHANES data you must use the sample weights to account for :

1. The base probability of selection
Stage 4 Individuals Stage 1 Counties Stage 2 Segments Stage 3 Households

2. Over sampling NHANE 1999-2004 - Oversampled African Americans
Mexican Americans Persons with low income Adolescents aged 12-19 Persons aged 60+

Non-response to the interview & exam Sample persons age 20+
Household interview N=10291 78% MEC Exam N=9471 71% Screening interview N=13312 Exam Non-response 7% Interview 22%

Non-response issues for NHANES
Most components have some level of individual item or component non-response. ONLY non-response to the interview and exam has already been accounted for in the weights. All additional non-response to the outcome measure of interest should be examined against all possible predictors. Potential biases should be discussed. If non-response is “high”, re-weighting should be considered.

Why weight? Sample Subdomain % US Population % sample unweighted
% sample weighted Non-Hispanic Blacks 13% 25% 12% Mexican Americans 9% 28% 12-19 year olds 24%

Sample weights – Which weights?
Weight Variables to Use Household Interview Data ONLY ANY Data from Exam/Lab/MEC Interview Any 2 yrs of data ( or or ) WTINT2YR WTMEC2YR 4 yrs of data ( ) * WTINT4YR WTMEC4YR 4 or 6 yrs of data ( ) or ( ) Combine appropriate 2 or 4 year weights as follows:

Two, Four, Six, Eight - How can we estimate?
For 4 years of data from MEC4YR = 1/2 * WTMEC2YR ; For 6 years of data from – if sddsrvyr=1 or sddsrvyr=2 then MEC6YR = 2/3 * WTMEC4YR ; /* for */ If sddsrvyr=3 then MEC6YR = 1/3 * WTMEC2YR ; /* for */ * Only when analyzing years , you should not combined 2 year weights but use the 4 year weights provided.

Two, Four, Six, Eight - How can we estimate?
Future years of data will be combined similarly: For 6 years of data from if sddsrvyr in (2,3,4) then MEC6YR = 1/3 * WTMEC2YR; For 8 years of data from – if sddsrvyr=1 or sddsrvyr=2 then MEC8YR = 1/2 * WTMEC4YR ; /* */ if sddsrvyr=3 or sddsrvyr=4 then MEC8YR = 1/4 * WTMEC2YR etc; /* */

Sample Weights - Subsamples
Subsamples and appropriate weights: Look at your primary variable of interest and the corresponding weight. Look at all other variables you want to combine with it. Are all from the interview? Exam? Subsample (i.e. fasting, audiometry, dioxin, VOC’s …) ? Use the weight from the smallest subsample for your analysis. Be consistent!

Preparing for Analyses
Subsetting the data for SUDAAN: If using MEC exam weights - SUBSET the data on those MEC EXAMINED in SAS before using SUDAAN. If using other subsample weights – subset the data on those in the subsample corresponding to the weights you are using. Then use the SUBPOPN statement in the SUDAAN procedure to further subset your data by age, gender etc. to reflect the target population you are interested in analyzing.

NHANES 1999-2000 Variance Estimation
Why must you use the sample design to estimate the variance? NHANES is a cluster design Individual within a cluster are more similar than those in other clusters. This homogeneity or clustering results in a reduction of our effective sample size because we choose individuals within cluster vs randomly throughout the population.

Why must you use the sample design to estimate the variance? Variance estimates that do not account for this intra cluster correlation are too low and biased. USE survey software such as SUDAAN or SAS survey procedures to account for the complex design and produce unbiased variance estimates These procedures require information on the sample design (i.e. identification of the PSU and strata) for each sample person.

Recommend using the Taylor Series (linearization) method Same as that used in NHANES III. We provide “Masked Variance Units” (MVU’s) in place of primary sampling units (PSU’s) to maintain confidentiality. Design variables are called - SDMVSTRA and SDMVPSU.

Design Variables SDMVSTRA and SDMVPSU Found in the demographic file.
Found in all two year data sets and can be combined for 4 or 6 or … year data sets. Can be used the same as the actual stratum and PSU variables. Produce variance estimates close to those using the “true” design. Data MUST be sorted by SDMVSTRA and SDMVPSU first, before using SUDAAN.

Using Sample Weights and Sample Design
Example: You are interested in examining the association of body mass index (BMI) on high triglycerides stratifying on race/ethnicity for females age from the 6 years of data from

Sample Weights Race/ethnicity, gender and age are in the interview.
Step 1 – Determine the smallest sample population for the analysis to determine the correct weight to use. Race/ethnicity, gender and age are in the interview. Weight used to calculate BMI comes from the MEC exam a subset of those interviewed. Triglycerides were measured on a subsample of those MEC examined who fasted for 8 hours and came to the AM MEC exam. Therefore, the fasting subsample is the smallest subsample in the analysis and you would use the AM fasting weights (WTSAF2YR and WTSAF4YR).

Sample Weights Step 2 – Combine weights in SAS prior to the SUDAAN procedure for the 6 years from : If sddsrvyr in (1,2) then WEIGHT6 =2/3*WTSAF4YR ; /* */ If sddsrvyr=3 then WEIGHT6= 1/3*WTSAF2YR ; /* */

Sample Weights Step 3 – Subset your data set in SAS to reflect the weight being used (AM fasting weights WTSAF2YR or WTSAF4YR): SAS Code: IF WTSAF2YR ne . or WTSAF4YR ne . ;

Sample Weights Step4 – Last specify the correct weight to use using the weight statement in SUDAAN and subset your data to obtain the subpopulation of interest using the SUBPOPN statement in SUDAAN (females age 20-59): WEIGHT WEIGHT6 ; SUBPOPN riagendr=2 and ridageyr > 19 and ridageyr < 60 ;

Sample Code Step 1 in SAS:
If WTSAF2YR NE . OR WTSAF4YR NE . ; (include only those with AM fasting weights) PROC SORT OUT=Datasort ; BY SDMVSTRA SDMVPSU; (sort on design variables)

Sample SUDAAN Code Step 2 - SUDAAN code :
PROC Descript DATA=Datasort DESIGN=WR ; NEST SDMVSTRA SDMVPSU ; WEIGHT WEIGHT6 ; VAR HI_TRI ; SUBPOPN RIAGENDR=2 AND RIDAGEYR>19 AND RIDAGEYR < 60 ; SUBGROUP Raceth1 HI_BMI ; LEVELS 4 2 ; TABLES Raceth1*HI_BMI ;

Sample code for SAS SURVEYMEANS Procedure
SAS code (Do not use WHERE or BY in procedure) : If RIDAGEYR > 19 AND RIDAGEYR < 50 AND RIAGENDR=2 then INCLSP=1 ; else INCLSP=2 ; PROC Surveymeans data=data ; Strata SDMVSTRA; Cluster SDMVPSU; Weight WTMEC2YR ; Var Hi_TRI ; Domain INCLSP INCLSP*RACETH1 INCLSP*HI_BMI INCLSP*RACETH1*HI_BMI ;

Analyzing data from NHANES 1999-2004
Crude versus Age Standardized Estimates: Age distributions within survey samples vary by racial/ethnic group. Age distributions also vary by survey – NHANES III vs. NHANES When comparing estimates across racial/ethnic groups or between surveys you may need to age standardize. Also present all age specific estimates!

When Age Standardizing: Use the 2000 U.S. Census Population for consistency for both NHANES III and all NHANES or above. For guidelines and population proportions see the website below for the Klein and Schoenborn HP2010 Statistical Notes on “Age Adjustment using the 2000 Projected U.S. Population”.

When Age Standardizing: In SUDAAN, use the STDVAR and STDWGT statements. STDVAR –variable name for the age groups. STDWGT – corresponding proportion of the 2000 U.S. Census population for that age subgroup.

Age standardization for NHANES
Crude vs. Age Standardized Estimates Example: Hepatitis B NHANES III Non-Hispanic White Non-Hispanic Black Mexican American Crude Prevalence 3.1 ( ) 11.9 ( ) 3.6 ( ) Age Standardized 2.6 ( ) 11.9 ( ) 4.4 ( )

Other data analysis issues from NHANES
Calculating Population Totals Estimates of the number of persons in the U.S. population with a particular condition must be done carefully. Recommended procedure is to: First, estimate the proportion with the condition for each subdomain of interest (i.e. age, race, gender). Mutliply that by the population control totals for that subdomain. Tables are available on the NCHS web site with the current March 2001 CPS control totals as part of the analytic guidelines.

Analyzing Data from NHANES 1999-2004
Analytic Guidelines: Detailed guidelines for working with NHANES data can be found at: Web based tutorial also currently available for NHANES and NHANES III can be found at:

Deanna Kruszon-Moran, MS

Similar presentations

Presentation on theme: "Deanna Kruszon-Moran, MS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deanna Kruszon-Moran, MS

Similar presentations

Presentation on theme: "Deanna Kruszon-Moran, MS"— Presentation transcript:

Similar presentations

About project

Feedback