Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L.

Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L. N. Silva (University of Southampton-UK)

2 Outline Introduction and motivation Sample design and selection Some results from the selected anonymised samples Conclusions and discussion

3 Social Security Databases Brazilian Social Security Administration (SSA) maintains huge databases of contributors and beneficiaries enrolled in the social security system Records held provide a rich source of information about participation in the formal labour market and in distribution of social security benefits In particular, they provide a longitudinal perspective that is unavailable from other sources –There are no major longitudinal surveys covering the working age population in Brazil

4 SSA databases – main issues Confidentiality and security means that they are inaccessible for research purposes Currently used only for production of aggregate level summaries, published on regular basis –Pre-defined cross-classified tables, at high-level aggregation –Broad indicators only Not available for user specific analysis One idea: anonymised samples of records

5 Anonymised Samples of Records Enable dissemination of individual anonymised microdata While protecting the confidentiality of individual records Popularised from applications in population censuses More recently, also applied for administrative records –Drazga(2008) describes the US experience –Examples from other countries like UK and others

6 Anonymised Samples of Jobs Database Goal: to design samples of SSA database records to be extracted and made available for analysis on regular basis Proposed sample design: stratified simple random sampling at each time point Rotation strategy: use Permanent Random Numbers (PRNs – e.g. Ohlsson 1995) to control sample overlap across time –Enables longitudinal analysis –Enables each sample to represent updated survey population –Simple, but effective rotation control

7 Sample Design & Selection Target population = all jobs held by workers affiliated to the General Social Security Regime (GSSR) in reference period Reference period = July 2001 till June 2002 Key domains of analysis defined as cross-classification of states (27 levels) x SIC of employer (four ‘sectors’) 1=Manufacturing, 2=Trade and distribution services, 3=Other services, 4=Agriculture, construction and other productive activities Main targets of inference: job status distribution 1=Active, 2=New admission, 3=Terminated in current month, 4=Terminated in previous periods, 5=Not reported

8 Stratification & Sample Size 57 explicit strata 40 strata = 10 states by 4 SIC groups +17 states with no further stratification (state-only strata) Sample size in each stratum to estimate proportions of at least 1.5% with a CV no larger than  10% n h = 6,300 records in 40 state by SIC strata n h = 12,600 records in 17 state-only strata Larger size in 17 state-only strata to enable domain estimation by SIC with some confidence Total sample size n = 466,200 job records (< 1.5% of total)

9 Maximum Relative Error for Sample Proportion under SRS with n=6,307 at 95% confidence

10 Rotation Scheme Designed to rotate out  1/12 of the sample at each new selection period We used monthly samples, but this can easily be changed to other periods, such as quarters, semesters, years, etc. Time in sample for each record  12 months (or periods) Time in sample not fixed, due to stochastic rotation control caused from using PRN sampling

11 Sample Sizes for Alternative Analysis

12 Selected estimates of total and proportions of jobs by status – April 2002

13 Selected estimates of counts and proportions of new jobs by activity sector – April 2002

14 Scatter plot of estimated proportions of new admissions and their CVs – April 2002

15 Proportions of jobs terminated in month t+k, for jobs existing (Active) or started (New admissions) in January 2002 (k=0)

16 Conclusions and discussion (1) Brazilian SSA could improve its approach for releasing statistical information about the formal labour market by providing access to anonymised samples of jobs This would enable satisfying analytical needs of many specialized users, while still protecting the confidentiality of individual records This would substantially enhance the capacity for the study and evaluation of the impact of public policies regarding the Social Security system in Brazil

17 Conclusions and discussion (2) The sample design proposed worked well in our application All the sample selection, estimation and analysis activities were carried out using a standard desktop microcomputer Once the samples are made available, analysts should have no difficulty in exploring the data for their own estimation and analysis activities The various analyses carried out with the selected samples illustrate the potential of such samples for analytical use

18 Conclusions and discussion (3) For cross-sectional estimates in any given month, the sample of approximately 466,200 records delivers precise estimates for some fine domains of interest For longitudinal analyses with samples six months apart, the sample would still have approximately 233,100 matched records available

19 Future Work Improved weighting methods for longitudinal analyses (e.g. following LAVALLÉE, 1995) Detailed analysis of disclosure risks associated with proposed sampling strategy Assess impact and introduce control measures to reduce bias caused by late reporting of new jobs (births) and jobs terminated (deaths)

20 Thanks for your attention.

21 References GONZALEZ, R. A. C. (2005). Amostragem longitudinal em registros administrativos: uma aplicação à previdência social. Rio de Janeiro: Escola Nacional de Ciências Estatísticas, MSc. Dissertation. DRAZGA, L. (2008). Uses Of Administrative Data At The U.S. Social Security Administration. LAVALLÉE, P. Cross-sectional weighting of longitudinal surveys of individuals and households using the weight share method. Survey Methodology v. 21, nº 1, p. 25-32, 1995. OHLSSON, E. Coordination of Samples using Permanent Random Numbers. In: Cox, Binder, Chinnappa, Christianson, Colledge & Kott (eds.) Business Survey Methods, New York, Wiley, p. 153-169, 1995.

22 Synchronised sampling algorithm Apply steps below within each selection stratum h Step 1 – Sort the records in the updated sampling frame in ascending order of the corresponding permanent random numbers (X hi ) Step 2 – Calculate the rank P hi of each record i in stratum h according to the corresponding associated permanent random numbers –The smallest position in the stratum shall be 1 and the largest shall equal N th, the total number of records in stratum h at time t

23 Synchronised sampling algorithm Step 3 – Determine the start and end points for sample inclusion in stratum h at time t using (1) (2) t=1 for July 2001 n th is the sample size in stratum h at time t T is the maximum number of rounds which a record is expected to be included in the sample mod{a ; b} is the remainder of the division of a by b.

24 Synchronised sampling algorithm Step 4 – If then include in the sample for time t the records with positions satisfying Otherwise, include in the sample for time t the records with positions satisfying or Repeat for each new survey round as needed (increase t)

Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L.

Similar presentations

Presentation on theme: "Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L.

Similar presentations

Presentation on theme: "Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L."— Presentation transcript:

Similar presentations

About project

Feedback