Download presentation
Presentation is loading. Please wait.
1
Anonymising quantitative data
Dr Sharon Bolton UK Data Service UK Data Archive, University of Essex Anonymising Research Data workshop Dublin, 22 June 2016
2
The UK Data Service Single point of access to wide range of social science data: ukdataservice.ac.uk Funded by the ESRC to serve the academic community: training and guidance; UK Data Archive established 1967 Used by academic researchers and students; government analysts; charities; business; research centres; think tanks Survey microdata; cohort studies; international macrodata; census data; qualitative/mixed methods data Support and guide data creators, including disclosure review (anonymisation) and preparation for archiving
3
Protecting confidentiality: the ‘5 Safes’
Five guiding principles: Safe people - educate researchers to use data safely Safe projects - research projects for ‘public good’ Safe settings - SecureLab system for sensitive data Safe outputs - SecureLab projects outputs screened Safe data - treat the data to protect respondent confidentiality For this session, we will concentrate (mostly) on Safe data
4
Data collection: planning
Explain to respondents what archiving entails and gain agreement for data sharing – informed consent Think about disclosure risks before starting – what kind of information do you need to collect? Direct identifiers include: names; addresses; telephone numbers; addresses; photos; (perhaps) IP addresses; do you really need them? Unless explicit consent obtained for sharing, direct identifiers should always be removed from data
5
Anonymising data: indirect identifiers
Indirect identifiers include: Sensitive information: health information/medical conditions; crime victimisation/offending; drug/alcohol use etc. ‘Less sensitive’ information: age/birth date; educational characteristics; employment details; religious affiliation; household size; geographic area Look at demographics in combination (e.g. demographics + geographies) Text/string variables – too detailed?
6
Anonymising indirect identifiers
Aggregate categories to reduce precision Band ages, incomes, expenditure, etc. to disguise outliers Use standard coding frames – e.g. SOC2010 Generalise meaning of detailed text Document the changes you make Talk to other researchers, archives, data services Published guides: UCD Research Data Management Guide ONS Disclosure control guidance for microdata produced from social surveys
7
Anonymising data: new developments and tools
Statistical Disclosure Control (SDC) software is available: mu-Argus standalone software package recommended by Eurostat for government statisticians software and manual: R tool - SDCMicro (GUI) Software, manual: new documentation being developed by UK Data Service, working with R developers
8
Quiz 1: disclosive text in job title
Frequency Valid Percent nurse 73 73.0 carer for elderly man 1 1.0 hospital ward cleaner social science researcher head of dental practice 2 2.0 cleaner in electronics factory Financial Director, Sunnyview Care Home, Colchester general manager GP Manager, Cotterill Village Stores works in electronics factory on benefits, not working police officer consultant, geriatric psychiatry Reetired retired Retired retirement geography teacher Teacher, music Seondary school teeacher unemployed web designer Total 100 100.0
9
Quiz 1: jobs coded with SOC2010
Job title: SOC2010 Frequency Valid Percent 1131: Director, financial 1 1.0 1171: Manager, general 1190: Manager, retail 2231: Nurse 73 73.0 2426: Researcher 2215: Dentist 2 2.0 2211: Doctor, medical 3312: Officer, police 2314 Teacher, secondary 3 3.0 2137: Designer, web 6145: Carer 9139: Worker, factory 9233: Cleaner Retired 4 4.0 Unemployed Total 100 100.0
10
Quiz 2: detailed religion categories
Religious affiliation Frequency Valid Percent 1 Protestant 41 41.4 2 Anglican 4 4.0 3 Catholic 26 26.3 4 Muslim 8 8.1 5 Sikh 5 5.1 6 Jehovah's Witness 6 6.1 7 Methodist 1 1.0 8 Mormon 9 Baptist 10 Buddhist 3 3.0 11 None 12 No religion 13 Moravian Total 99 100.0
11
Quiz 2: religion categories aggregated
Religious affiliation Frequency Valid Percent 1 Protestant 49 49.0 3 Catholic 26 26.0 4 Muslim 8 8.0 5 Sikh 5 5.0 6 Other religion 10 10.0 7 No religion 2 2.0 Total 100 100.0
12
Quiz 3: age in years Age in years Frequency Valid Percent 16 3 3.0 17
Frequency Valid Percent 16 3 3.0 17 18 9 9.0 19 20 16.0 21 4 4.0 22 2 2.0 23 24 25 26 27 28 29 30 31 1 1.0 32 40 11 11.0 41 42 43 49 50 13 13.0 51 60 61 62 63 64 Total 100 100.0
13
Quiz 3: banded age Age (banded) Frequency Valid Percent 1 16-20 40
Frequency Valid Percent 40 40.0 22 22.0 13 13.0 19 19.0 6 6.0 Total 100 100.0
14
Access control Don’t over anonymise - find balance between protecting respondents’ confidentiality and maintaining research usability of data Can’t fully anonymise data without removing all the useful detail? Go back to the 5 Safes – think about access control: Safe people, Safe settings, Safe outputs
15
Access control At UK Data Service, data available under 3 access levels: OPEN – open public access SAFEGUARDED – downloadable, but use is traceable Registered users only (agree not to try to identify any individual respondents) Special agreements/licence: permission-only access; approved projects – usage agreed in advance CONTROLLED – accredited users take a further training course Access via on-site safe setting or virtual secure environment (SecureLab) Outputs disclosure-checked before publication
16
Anonymising quantitative data: summary
Informed consent Think about level of detail needed before data collection Remove direct identifiers Check and treat indirect identifiers to reduce disclosure risk Document your changes Balance anonymisation with access control to preserve data usability
17
Questions? Guidance on anonymisation:
UCD: UKDS: Managing and Sharing Research Data book
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.