Working with the ECLS-B Datasets Weights and other issues. Information is courtesy of the Institute of Educational Sciences, National Center for Education Statistics and is used in their training seminars.
Sampling Weights What are sampling weights and why are they important? How are weights used? What weights are on the ECLS-B data files and when should they be used?
What is a “Weight” ? A weight is used to indicate the relative strength of an observation. In the simplest case, each observation is counted equally. For example, if we have five observations, and wish to calculate the mean, we just add up the values and divide by 5.
How are Weights Used? Dataset with 5 cases. Value 4 2 1 5 2 Sample mean (4+2+1+5+2) = 2.8 Weighted mean (4*1) + (2*2) + (1*4) + (5*1) + (2*2)/sum of weights = (4 + 4 + 4 + 5 + 4)/10 = 2.1
What is the Difference Between Weighted and Unweighted Data? With unweighted data, each case is counted equally. Unweighted data represent only those in the sample who provide data. With weighted data, each case is counted relative to its representation in the population. Weights allow analyses that represent the target population.
ECLS-B and Weights The ECLS-B is a sample, i.e. the entire population was not surveyed. The ECLS-B is not a simple random sample (SRS). That is, not all children had an equal probability of selection. Not all sampled children, mothers, and fathers participated.
Why Use Weights in the ECLS-B? Weights compensate for collecting data from a sample rather than the entire population and for using a complex sample design. Weights adjust for differential selection probabilities and reduce bias associated with non-response by adjusting for differential nonresponse. Weights are used when estimating characteristics of the population.
Examples of Weighted vs. Unweighted Data Race/Ethnicity Unweighted % Weighted % White, non-Hispanic 42 54 Black, non-Hispanic 16 14 Hispanic 21 26 Asian/Pacific Islander 12 3 Native American < 1 Multiple race, non-Hispanic 7 4
Examples of Weighted vs. Unweighted Data Birth Weight Characteristics Unweighted % Weighted % Average, in pounds 6.46 7.31 Normal birth weight 74 93 Moderately low birth weight 15 6 Very low birth weight 11 1
2 Year Weights (No father data) Use W2R0 if using 2-year parent interview data (with or without 9-month parent interview data. Use W2C0 if using 2 year child assessment data (with or without 9-month child assessment data) with or without any parent interview data.
2 Year Weights (with father data) Use W2F0 if using 2-year father data AND 9-month father data (resident and/or non-resident) with or without any parent interview data. Use W2FC0 if using 2 year father data, resident and/or non-resident (with or without 9-month father data), with at least one round of child assessment data, with or without any parent interview data.
2 Year Weights (with father data) Use W22F0 if using 2-year father data (without 9-month father data) with or without any parent interview data. Use W2C1F0 if using 9-month father data in combination with any parent interview data and any child assessment data.
2 Year Weights (with CCP data) Use W2C2J0 if using 2-year Child Care Provider (CCP) data, with at least one round of child assessment data, with or without any parent interview data. Use W22J0 if using 2-year Child Care Provider (CCP) data alone, or with any parent interview data.
2 Year Weights (with CCO data) Use W2C2P0 if using 2-year Child Care Observation (CCO) data, with at least one round of child assessment data, with or without any parent interview data or CCP data. Use W22P0 if using 2-year Child Care Observation (CCO) data alone, or with any parent interview data or CCP data.
How to Use Weights In SAS, use the “WEIGHT” statement. In SPSS, use the “WEIGHT BY” statement. Key Fact: All ECLS-B weights sum to population totals.
Weights in SAS SAS uses the WEIGHT statement in various PROCedures. PROC FREQ data = test; Tables Age Gender Score; Weight weightvar; Run;
Weights in SPSS LIST VARIABLES = age to weightvar. Frequencies variables = age, score /sta=default. weight by weightvar. frequencies variables = age, score /sta=default.
Weights in STATA clear use “c:\temp\test1.dta" tabulate score age gender [pweight=weightvar]
Weights for HLM Users HLM users can use the base weight, W1BASEWT, for longitudinal analyses where occasions are nested within children. Twins can be nested within families in a two-level model. Each twin has their own weight to be used at level one.
Summary about Weights Weights should be used when analyzing data from the ECLS-B. Parent and father weights sum to the number of children, not the number of parents or fathers. The appropriate weight should be selected based on the research question (which determines the data source).
Variance, Calculating Standard Errors Why are standard errors important? Why not use standard errors that assume a simple random sample (SRS)? How to use “exact” methods for estimating standard errors. How to use approximation methods for estimating standard errors.
Why are Standard Errors Important? Standard errors are produced for estimates from sample surveys. They are a measure of the variance in the estimates associated with the selected sample being one of many possible samples. Standard errors are used to test hypotheses and to study group differences. Using inaccurate standard errors can lead to identification of statistically significant results where none are present and vice versa.
Important Considerations The ECLS-B sample is only one of many samples of infants born in 2001 that could have been selected. The ECLS-B has a complex sample design and is not a simple random sample. All weights on the ECLS-B data files sum to population total and not sample totals.
The ECLS-B Sample Design The ECLS-B has a complex sample design. Children are clustered within primary sample units (PSUs). Low birth weight and very low birth weight children are oversampled. Chinese American, Asian American, and American Indian/Native Alaskan children are oversampled. Twins are oversampled.
The ECLS-B Sample Design: Clustering Sample children were clustered within primary sampling units (PSUs) to reduce field costs. Children were in closer geographical proximity than would occur in a simple random sample. Children in a clustered sample tend to be more alike than those in a simple random sample.
Complex Samples and Standard Errors The usual standard error formula assumes a simple random sample. Standard errors for estimates from a complex sample must account for the within cluster/across cluster variation. Special software can make the adjustment, or this adjustment can be approximated using the design effect.
Options Exact Methods such as the TAYLOR series and REPLICATION techniques. Approximation Method
Exact Methods Taylor series Extract PSU and strata Ids from data file. Software available: SUDAAN, STATA (using SVY commands), and SAS (using PROC SURVEY commands).
Exact Methods Replication Techniques Extract replication weights (90 of them). ECLS-B replication weights use jackknife 2 (JK2) methods. Software: WESVAR replication series (JK2), AM (JK2), and SAS callable SUDAAN.
Approximation Method Two stages: First, normalize weights so standard error is based on actual sample size rather than population size. Then, use design effect (DEFF) to account for complex sampling design.
1) Normalizing Weights * Weights on the ECLS-B sum to the population totals. Calculate a new weight that sums to the sample size. Normalized weights = (ECLS-B weight) * (sample n/population N). *SAS users do not need this step since estimates are produced based on the actual sample size.
Example – Normalizing Weights Weight to be normalized: W1R0 Sum of weights: 3,997,169 Total number of cases with a positive weight: 10,688 Normalized weight = W1R0 * (10,688 / 3,997,169)
2) Adjusting for Complex Design The ECLS-B has a complex sample design; it is not a simple random sample. Software packages designed for simple random samples tend to underestimate the standard errors for complex sample designs. Special methods are required for complex designs.
Using Design Effects (DEFF) What is a design effect (DEFF)? It’s the ratio of the variance found in actual (complex) sample design to the variance expected in a simple random sample of the same sample size.
Using Design Effects (DEFF) DEFT = the square root of DEFF = (Design standard error/ simple random sample error). Example for X1NCATS = Total Child Score SE (SRS) = 0.029 SE (Design) = 0.048 DEFF = 0.0482/ 0.0292 = 2.707 DEFT = 0.048/ 0.029 = square root of 2.707 = 1.645
3 Ways of Using the DEFF Multiply the SRS (simple random sample) standard error produced by statistical software (when using normalized weights) by the square root of the DEFF (DEFT). Or Adjust the t-statistic by dividing it by the square root of the design effect (DEFT) or adjust the F-statistic by dividing it by the DEFF. Adjust the weight such that an adjusted standard error is produced.
Using a DEFF- Adjusted Weight First step, create a weight that sums to the sample size (normalized weight. Second, divide this normalized weight by the DEFF. Third, use this weight for analyses. The standard errors produced will approximate the standard errors obtained using “exact” methods.
Where to find ECLS-B DEFF’s Training material: “ECLS-B Specifications for Computing Standard Errors” 9-month User’s Manual: Exhibit 4-5 and table 4-6. 2 year User’s Manual: Exhibit 4-10 and tables 4-10 and 4-26.
For SAS Users SAS base procedures such as PROC REG, PROC FREQ, and PROC MEANS do account for the actual sample size but not for complex sampling. SAS procedures such as PROC SURVEYMEAN and PROC SURVEYREG (and other procedures that begin with “Survey”) use the Taylor series method to account for complex sampling and provide exact estimates of the standard errors.
PROC SURVEYREG Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyreg data = fscores; model c4r3mscl = c2r3mscl lowkread t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;
PROC SURVEYLOGISTIC Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveylogistic data = fscores; model lowkread (desc) = c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;
PROC SURVEYFREQ Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyfreq data = fscores; tables lowkread c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; run;
STATA Code for Complex Design Logistic Regression Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : logit highbmi white
STATA Code for Complex Design Regression Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : reg highbmi white
STATA Code for Complex Design Means Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : mean highbmi female
SPSS for Complex Sample Design Use add-on to SPSS called, SPSS Complex Samples™ Complex Samples Logistic Regression (CSLOGISTIC)—Performs binary logistic regression analysis, as well as multiple logistic regression (MLR) analysis, for samples drawn by complex sampling methods. The procedure estimates variances by taking into account the sample design used to select the sample, including equal probability and PPS methods, and WR and WOR sampling procedures. Optionally, CSLOGISTIC performs analyses for subpopulations. Courtesy of SPSS
Regression Analysis Use appropriate software such as AM, WESVAR, SUDAAN or SAS (SURVEYREG procedure). For SAS (PROC REG procedure), use DEFF-adjusted weights. For SPSS, use normalized, DEFF-adjusted weights.
Summary All statistical tests should be based on standard errors that are calculated to account for the complex sample design of the ECLS-B. Preferred: Use software that incorporates JK2 replication methods, or Use software that incorporates Taylor series method, or Last resort: Make approximate adjustments based on design effects.
ECLS-B Data Availability 4 Year (Pre-School) now available. Kindergarten wave will be split into 2 groups, as the children of the ECLS-B enter kindergarten over a 2 year period.