Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Outline Multiple releases: MFR and PUF Subsampling –allocation: reduce the risk of disclosure –selection: pre-defined quality standards Results –Career of Doctorate Holders Survey Further work
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple … Multiple countries Multiple countries MS1 MS2 SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases MS27 Multiple countries Multiple surveys
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Comparability ESSnet on SDC harmonisation and common tools –WP1: test the comparability concept –Istat, Destatis, Statistics Austria –multiple countries 1 Assessment of effects of different practices on predefined statistics 2 Definition of a threshold to define when action is needed 3 setting a process for choosing acceptable practices HOW
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 A particular harmonisation dimension Hierarchical structure –Utility –Risk of disclosure
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases hierarchical structure MFR + - More restrictive license PUF + - Less aggregated information Less restrictive licenseMore aggregated information UNIQUE PRODUCTION PROCESS!
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-MFR MFR –definition of a disclosure scenario –risk assessment R 1 –risk limitation w.r.t. adopted disclosure scenario some data utility requirements PUF –harmonized with the MFR (e.g. weighted totals) –reduced the risk of disclosure –random sample –internal consistency of records –some (other) data utility requirements (CV and weighted totals – precision and accuracy)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Data description Year t-5Year t-3 Year t Doctorate Holders CDH 2009 Survey Estimates by PhD scientific area, by gender and by region labour market entry usefulness of the PhD for obtaining a job type of contract type of work earnings job satisfaction Focus on the characterisation of the occupational status of the PhD holders:
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 72% resp 28% No resp respondents PhD Holders (Census) Citizenship (2 categories) PhD Scientific Area (14 categories) GenderRegion weights obtained by constraining on known marginal distributions: Adjustment for non-responses via calibration Data description
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling Simple random sampling Utility: Weighted totals may always be preserved by calibration Risk: how many units at risk are sampled? Example (MFR-CDH): units, 24.7% of units at risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Subsampling allocation domains utility disclosure sample size stratification dissemination totals scenario calibration key variables quality users auxiliary
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling: proposal 1.Optimal allocation of units to be sampled in each domain according to Bethel’s approach (Risk minimization) 2.Selection of a fixed size balanced sample(CUBE method) (Data utility maximization)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona ● Cost function to minimize: ● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain j d equal or lower than prefixed thresholds: 1. Bethel’s approach (1989) n h and C h related to the risk to be reduced Optimal allocation: n h *
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 2. Balanced sampling A sampling design s is said to be balanced on the auxiliary variables if and only if the balancing equations given by: are satisfied, where X is the vector of known population totals, is the H.-T. estimator exact estimates for pre-defined variables
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sampling: the CUBE method Geometrically each vertex of the hypercube is a sample: The balancing equations define a sub- space of R N named K. The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K (111) (000) (100) (101) (010) (011) (110) Cube method (Deville & Tillé,2004): 1.Flight phase: it’s a random walk starting from the vector and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists. 2.Landing phase: At the end of the flight phase, if a sample is not exactly determined in C ∩ K, a sample is selected as close as possible to the constraints space K. K
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Implementation 1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables Allocation variables: Occup, JobS, Contract, Work, Income Domain variables: Gender, Region, Scientific Area, Year of Completion 2.six possible settings, corresponding to different choices of the parameters: a. Risk R1 used as the minimization cost of the algorithm b. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona C.S Risk.cost Risk.strat Cens.no.risk # Strata #Cens.strata #Cens.units Size Bethel Size Prop. Size Equal Max.Bethel-PropMax.Bethel-Equal 1NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY Allocations (CV* = 5%)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Allocations
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sample Selection of samples of fixed size from the CDH survey: Utility constraints on: the population size N the optimal sample size n the marginal frequency distributions by Gender, Year of Doctorate Completion and Scientific Area 18 equations CUBE algorithm: I. Input Vector is the optimal one determined by Bethel II. Flight phase ends with no exact solution III. Landing phase starts: selection of a sample which ensures a low difference to the balance, according to the distance between p * to p
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Median of absolute relative errors Results
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Results C.S Risk.cost Risk.strat Cens.no.risk Risk Occup JobS Contract Work Income 1NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY * YNN YNY
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Further work 1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design; 2. the introduction of an utility-priority approach into the way to deal with the balancing equations; 3. the usage of other data utility constraints to be investigated.