Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager
Overview What is the SLS? How the SLS can currently be accessed How the SLS hopes to use synthetic data
What is the SLS?
The SLS is a large-scale, anonymised linkage study designed to capture 5.5% of the Scottish population The sample is based on 20 semi-random birthdays It’s a joint project between University of Edinburgh, University of St Andrews and National Records of Scotland (NRS) It is built using data available from… Census Vital Events NHSCR (migration into or out of Scotland) Education (School Census, Absences and SQA qualifications) NSS health data (linked on a project by project basis)
Aims and scope Aims: Continue building and developing the SLS; Support researchers who wish to undertake projects with the SLS data; Provide web-based resources that help make use of the SLS easier; Provide training on the SLS and longitudinal data handling, analysis and modelling. Scope: Research into demographic, health and social questions in Scotland; Support is primarily given to academic researchers, and secondly to non-academic researchers for non-commercial use.
Security & Confidentiality Dataset is held in a secure environment at NRS (access to the building is controlled, passes are worn at all times and visitors are escorted) Data are accesses in a keypad-secure environment Computers are on a password-protected, stand-alone network Abide by all relevant protocols on data sharing, access and security Data access strictly controlled Release of the results of data analysis are all disclosure checked
How the SLS can currently be accessed
Accessing the SLS There are currently 2 ways to access the SLS Remote access Safe Setting access
Types of data access: Remote Access Analysis Researchers can specify the analyses by writing syntax code in SPSS, SAS or Stata, and sending this to their SLS Support Officer. Use the web-based Data Dictionary for looking up variable names and category names ( Or Support Officer will the researcher an ‘empty shell’ including variable labels and value labels to aid writing the syntax. The Support Officer will then run the analysis on the real dataset.
Types of data access: Remote Access Outputs The Support Officer will check the output of the analyses to check for confidentiality issues. If the output is disclosive, your Support Officer does one of the following two things: alters the output slightly so that it no longer contains disclosive elements. informs you that the analyses you wish cannot be carried out because they breach the confidentiality rules. Cleared output is sent to researchers (by in an encrypted attachment). Researchers never receive the real dataset. Remote access only provides you with cleared analysis outputs, such as frequency tables, cross tabulations, or regression model parameters.
Types of data access: Remote Access ProsCons Can work from the comfort of own home/ office Get no feel for the data Can access textbooks and internet whilst writing syntax Can be a long process if models need tweaking and rerun Don’t need to travel to the Safe Setting in Edinburgh Very reliant on Support Officer
Types of data access: Working in the safe setting room If you wish to analyse the data yourself – as most users do especially at the initial stages of recoding variables and exploratory analysis – you will need to visit NRS in Edinburgh to work with in the safe setting (safe haven) room. You will not have access to the entire SLS database (only the sub-set of data extracted for your project). The computers for analysis are not connected to the outside world and are only equipped with a CD-ROM reader. You cannot take your outputs home immediately, because they first have to be cleared by the SLS Team (the encrypted outputs will be sent to you afterwards).
Types of data access: Working in the safe setting room ProsCons Work with the data hands onMust travel to the Safe Setting in Edinburgh Can tweak and rerun modelsNo internet access within the SLS Support Officer on hand to provide advise Strict rules within Safe Setting
How the SLS hopes to use synthetic data
Why use synthetic data? The sensitive nature of the information the SLS contains means that access to the microdata is highly restricted. Consequently, compared to other census data products the SLS is used by a small number of researchers – a situation which limits their potential impact. Using synthetic data will facilitate access to the SLS while protecting confidentiality.
Synthetic data for the SLS - SYLLS Synthetic SLS data spine (1991 & 2001) Age, sex, marital status, ethnicity, limiting long term illness and geography Open access via CALLS Hub and SLS Bespoke synthetic datasets Synthetic versions of data extracts to match individual user data requests Provided to approved researchers for preliminary analysis, final analysis will be run on the real data in safe settings
Synthetic SLS data spine Aims Provide web-based resources that help make use of the SLS easier; Provide training on the SLS and longitudinal data handling, analysis and modelling. Benefits Will allow a small subset of longitudinal data to be made available online. Uses Will allow potential users to access a small subset of data online and allow them to consider and practice longitudinal analysis techniques Used in SLS training courses Freely available for others to use as a training dataset
Bespoke synthetic datasets Aims Support researchers who wish to undertake projects with the SLS data; Benefits A good compromise between the current access options. The synthetic dataset can be accessed at home and will look (structurally) and behave (statistically) like original confidential data but will contain artificial units only. Uses Allow researchers to access a synthetic version of their dataset at home Allow researchers to write syntax and develop models using synthetic data which should behave like the original data
Coming soon…….. Access to SLS-like data on own computer: Spine datasets available soon via CALLS Hub and SLS website Following formal approval bespoke synthetic data should be available for SLS users in 2015
For more information SLS Website – sls.lscs.ac.uk – Twitter SYLLS Website – data-estimation-for-uk-longitudinal-studies/