Download presentation
Presentation is loading. Please wait.
Published byPhebe Powers Modified over 9 years ago
1
Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008 Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008
2
Disclaimer The opinions and assertions are those of the authors and do not reflect the views or policies of the National Science Foundation
3
Survey Data Collection Involves many complex processes including Involves many complex processes including –Sampling frame construction –Sample selection –Data collection –Data processing –Estimation Each process subjects to error Each process subjects to error Attempt to decompose the total survey errors into separate stages of processes Attempt to decompose the total survey errors into separate stages of processes Involves many complex processes including Involves many complex processes including –Sampling frame construction –Sample selection –Data collection –Data processing –Estimation Each process subjects to error Each process subjects to error Attempt to decompose the total survey errors into separate stages of processes Attempt to decompose the total survey errors into separate stages of processes
4
Total Survey Errors Sampling Frame Parameter Estimator SampleRespondentData Misclassification error Coverage error Sampling error Nonresponse error Measurement error Estimation error
5
Misclassification Error in Stratification Focus of this talk Focus of this talk A part of non-sampling error A part of non-sampling error Important but often overlooked component Important but often overlooked component Focus of this talk Focus of this talk A part of non-sampling error A part of non-sampling error Important but often overlooked component Important but often overlooked component
6
Stratification in Sampling Enhance precision of survey estimates Enhance precision of survey estimates Precision requirements for analytic domains Precision requirements for analytic domains Often imperfect information on stratification variables Often imperfect information on stratification variables –Misclassification in stratification Enhance precision of survey estimates Enhance precision of survey estimates Precision requirements for analytic domains Precision requirements for analytic domains Often imperfect information on stratification variables Often imperfect information on stratification variables –Misclassification in stratification –Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocation –Loss of effective sample sizes for some analytic domains –Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocation –Loss of effective sample sizes for some analytic domains
7
Misclassification Matrix the proportion of units classified as category j in true category k and True classification A Stratification classification A*
8
Measures for Misclassification Effects Bias Bias Effective sample size change Effective sample size change Bias Bias Effective sample size change Effective sample size change
9
Bias Due to Misclassification where = true population props. s denotes sample, w i the sampling weight for unit i, and I(.) the indicator function = sample proportions Identity matrix = Identity matrix Kuha and Skinner 1997
10
Bias Estimation where If the true classification is available from the sample:
11
Effective Sample Sizes and Variance Inflation Factors Measures the inflation of variance due to weight variation Measures the inflation of variance due to weight variation for domain d constructed based on true value for domain d constructed based on misclassified value
12
Example: National Survey of Recent College Graduates (NSRCG) Sponsored by National Science Foundation Sponsored by National Science Foundation Collecting education, employment, and demographic information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fields Collecting education, employment, and demographic information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fields For details, For details, –http://www.nsf.gov/statistics/srvyrecentgrads http://www.nsf.gov/statistics/srvyrecentgrads Sponsored by National Science Foundation Sponsored by National Science Foundation Collecting education, employment, and demographic information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fields Collecting education, employment, and demographic information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fields For details, For details, –http://www.nsf.gov/statistics/srvyrecentgrads http://www.nsf.gov/statistics/srvyrecentgrads
13
NSRCG (Continued) Two stage sample design: school sample at the first stage and graduate sample at the second stage Two stage sample design: school sample at the first stage and graduate sample at the second stage Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables) Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables) Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholds Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholds Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete administrative data Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete administrative data Two stage sample design: school sample at the first stage and graduate sample at the second stage Two stage sample design: school sample at the first stage and graduate sample at the second stage Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables) Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables) Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholds Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholds Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete administrative data Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete administrative data Jang and Lin (2007 JSM)
14
NSRCG (Continued) Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also collected from sampled graduates Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also collected from sampled graduates Able to measure the quality of school provided information for stratification by assessing discrepancies between school provided information and reported values Able to measure the quality of school provided information for stratification by assessing discrepancies between school provided information and reported values Looking at two survey data (2003 and 2006 NSRCG) Looking at two survey data (2003 and 2006 NSRCG) Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also collected from sampled graduates Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also collected from sampled graduates Able to measure the quality of school provided information for stratification by assessing discrepancies between school provided information and reported values Able to measure the quality of school provided information for stratification by assessing discrepancies between school provided information and reported values Looking at two survey data (2003 and 2006 NSRCG) Looking at two survey data (2003 and 2006 NSRCG)
15
Misclassification for Gender NSRCG2003NSRCG2006 ReBias for P Male = -0.01% ReBias for P Male = 0.50%
16
Misclassification for Race/Ethnicity NSRCG2003NSRCG2006
17
Effective Sample Sizes and Variance Inflation Factors What if taking reported values for discrepant cases? What if taking reported values for discrepant cases? Result in more weight variation within domains based on reported values due to unequal selection probabilities across classes Result in more weight variation within domains based on reported values due to unequal selection probabilities across classes Check domain specific sample sizes and variance inflation factors Check domain specific sample sizes and variance inflation factors What if taking reported values for discrepant cases? What if taking reported values for discrepant cases? Result in more weight variation within domains based on reported values due to unequal selection probabilities across classes Result in more weight variation within domains based on reported values due to unequal selection probabilities across classes Check domain specific sample sizes and variance inflation factors Check domain specific sample sizes and variance inflation factors
18
Variance Inflation Factors NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field by gender = White, = Asian, = Minority = White, = Asian, = Minority
19
Ratio of Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field by gender = White, = Asian, = Minority = White, = Asian, = Minority
20
Ratio of Effective Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field by gender = White, = Asian, = Minority = White, = Asian, = Minority
21
Variance Inflation Factors NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field = White, = Asian, = Minority = White, = Asian, = Minority
22
Ratio of Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field = White, = Asian, = Minority = White, = Asian, = Minority
23
Ratio of Effective Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by degree level by major field = White, = Asian, = Minority = White, = Asian, = Minority
24
Variance Inflation Factors NSRCG2003NSRCG2006 Domain: race/ethnicity by gender = White, = Asian, = Minority = White, = Asian, = Minority
25
Ratio of Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by gender = White, = Asian, = Minority = White, = Asian, = Minority
26
Ratio of Effective Sample Size, n_R / n_F NSRCG2003NSRCG2006 Domain: race/ethnicity by gender = White, = Asian, = Minority = White, = Asian, = Minority
27
Summary Misclassification in stratification may reduce the effective sample sizes for domains that were sampled with high sampling rates Misclassification in stratification may reduce the effective sample sizes for domains that were sampled with high sampling rates Crucial to have good classification in stratification, especially with substantially unequal probability selections implemented Crucial to have good classification in stratification, especially with substantially unequal probability selections implemented Misclassification in stratification may reduce the effective sample sizes for domains that were sampled with high sampling rates Misclassification in stratification may reduce the effective sample sizes for domains that were sampled with high sampling rates Crucial to have good classification in stratification, especially with substantially unequal probability selections implemented Crucial to have good classification in stratification, especially with substantially unequal probability selections implemented
28
Next Steps Population counts for key domains available but based on misclassification Population counts for key domains available but based on misclassification Estimation of population counts: Estimation of population counts: –Weighted sums of correct classification from the sample –Use of misclassification parameter estimates, where is the vector with population counts of domains defined by A* Raking adjustments of the weights using Raking adjustments of the weights using Comparison of key estimates Comparison of key estimates Population counts for key domains available but based on misclassification Population counts for key domains available but based on misclassification Estimation of population counts: Estimation of population counts: –Weighted sums of correct classification from the sample –Use of misclassification parameter estimates, where is the vector with population counts of domains defined by A* Raking adjustments of the weights using Raking adjustments of the weights using Comparison of key estimates Comparison of key estimates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.