Statistical Disclosure Control for the 2011 UK Census Keith Spicer Office for National Statistics
Overview Disclosure Risk UK Census – context Evaluation of methods Proposed strategy Further work
What is disclosure risk? There is a disclosure risk when information is published that could allow an intruder to indicate the identity or particulars of: an individual a household or family a business or another statistical unit
Statistical Disclosure Control Statistical Disclosure Control (SDC) involves either: introducing sufficient ambiguity / damage into, or reducing level of detail of published statistics so that the risk of disclosing confidential information is reduced to an acceptable level and / or: controlling access to data
Risk – Utility balance Disclosure Risk: Information about confidential units Data Utility: Information about legitimate items Original Data No data Released Data Maximum Tolerable Risk High Low
UK Census - Context (1) 2001 – random record swapping SCA applied in E, W, NI, not in Scotland Lack of harmonisation and late changes SCA protected individual tables, but some remaining risk through differencing
UK Census - Context (2) RsG agreement November 2006 –Small cell counts as long as ‘sufficient uncertainty’ –Main risk attribute disclosure – finding out something new about an individual…….. Evaluation to short-list –Qualitative – including user acceptability, additivity, consistency, feasibility –3 methods: Record swapping Over imputation IACP method (post-tabular) based on ABS
UK Census - Context (3) Short-list of 3 methods evaluated Quantitative assessment using 2001 Census data, using different measures of risk and utility –Protection against disclosure (and differencing) –Measures of association –Effect on totals & sub-totals –Variances –Rankings Revisit qualitative aspects Proposed Strategy – Record Swapping
Proposed Strategy: Record Swapping Swap the geographical location of a small number of households Households are paired according to similar characteristics (to avoid too much data distortion) Creates uncertainty in the data Can target risky records
B Area B A Treatment: FFind a different geographical Area F Identify another individual in a different area with the same characteristics on matching variables F Swap the two records Characteristics: Age: 22, Sex: Male, Marital Status: Single Economic activity: Student Tenure: Rented Characteristics Age: 22, Sex: Male, Marital Status: Single Economic activity: Active Tenure: Owned Matches all variables except economic activity and tenure Swap records Record swapping
Pre-tabular method protects underlying microdata Protected tables will be additive and consistent Minimise bias by use of matching variables Vary swap rates by geographical level Relatively simple to understand and implement Some risks from population uniques at higher geographies (in microdata) Need consideration for ‘special outputs’
Record swapping – further work Determine swapping rates –Set tolerable risk threshold –Vary by geographical level Targeted or random –How to determine ‘risky’ records Take into account levels of imputation Interaction with output design –Flexible table / hypercube solutions – how much detail can we have in a hypercube? –Additional ‘rules’ around table design –Geography – providing ‘exact fit’?
Record swapping – further work Protecting outputs for special populations –Workplace zones –Communal establishments Origin-destination tables –Protection of most detailed via licensing –Consideration of what can be ‘public use’ Microdata –Suite of products –Detailed content Record swapping will be ‘smarter’ in 2011 – targeting risky records at low geographies
Summary Extensive evaluation of SDC methods Record swapping primary strategy for tabular outputs ‘Smarter’ Further work continues
Output Geography Andy Tait/Ian Coady ONS Geography
Overview Background –2001 Output Geography - OAs –Neighbourhood Geographies - SOAs What has changed since 2001? 2011 Requirements –2007 Geography Consultation – what you said –Resulting Policy Work in progress –OA/SOA Maintenance Research project –Workplace Zones 2009 Geography Consultation
2001 Output Areas - why Census output geography separated from data collection geography a geography created from Census data consistent size in population/no of households socially homogeneous meets confidentiality thresholds aligns with administrative boundaries Consistent throughout UK
2001 Output Areas 175,000 output areas Mean 297 persons; 123 households Freely available digital boundary data Building blocks for “neighbourhood” geographies: Super Output Areas (LSOAs, MSOAs) Image courtesy of David Martin. This work is based on data provided through EDINA UKBORDERS with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown.
2001 Output Areas – achieved size hhds Pop
Super Output Areas (SOAs) created 2004, for Neighbourhood Statistics groupings of Output Areas layered hierarchy – lower, middle, upper layers each layer with size thresholds and targets offer levels of statistical reporting Lower SOAs ≈ approx 35,000 OAs, avge pop ≈ 1,500 - created automatically Middle SOAs ≈ approx 7,000 OAs, avge pop ≈ 7,200 - created automatically – modified locally Upper SOAs not created
Wards 1998 Index of Deprivation 1998
Index of Deprivation 2004 Lower Layer SOAs 2004
Changes since population Population growth, especially migration More and smaller households Newly built properties –Greenfield/new land –Brownfield/in-filling Sub-division of existing properties Changing socio-economic characteristics of areas
Changes since geography Postcodes Census address register Ward/parish changes since 2003 Administrative re-organisation
How much change by 2011 Lower threshold Upper threshold Population threshold OAs100 people625 people (2 *target) 2.5 * household thresholds LSOAs1000 people 3000 people (2 *target) 2.5 * household thresholds MSOAs5000 people people (2 *target) 2.5 * household thresholds
How much change by 2011? threshold breaches, based on mid-year population estimates Output Areas: 2005 below2005 within2005 above2001 totals 2001 below within above totals %
How much change by 2011? Lower Layer Super Output Areas: 2005 below2005 within2005 above2001 totals 2001 below within above totals %
How much change by 2011? Middle Layer Super Output Areas: 2005 below2005 within2005 above2001 totals 2001 below within above totals %
Key messages Most output areas (and LSOAs, MSOAs) unlikely to have breached thresholds by 2011 BUT, changes clustered geographically, so could breach badly in some areas Some areas already known to be problematic in 2001
Small Area Geography Consultation 2007 Strong support for: Stability with 2001 (but reflect change!) Easy/free licensing of boundaries Mean high water boundary set England/Scotland alignment Some support (in descending order) for: Aligning boundaries to real world features Separating communal establishments Retaining postcode blocks v street blocks Building a separate set of zones based on workplace Building separate OAs with no population Building an Upper layer of SOAs
Resulting in ONS policy for 2011 Geography……… Change only significant population change: – split where populations too big – merge where population too small No more than 5% overall change (could be well under) Assess methods of splitting/merging No real world alignment for its own sake Consider redesign of extreme cases where unfit as statistical zone No separate “empty” OAs Align Scotland and England at the border Mean high water boundaries as well Investigate new workplace geography linked to OAs Keep licensing free, get better deal for commercial use Exact count outputs for OAs and other geographies, e.g. wards – a matter for disclosure control
OA/SOAs – some “not fit for purpose”?
OA/SOAs – not fit for purpose” ?
Challenges for 2011 output geography design Stability at what level? OA, LSOA, MSOA? Building blocks? Postcodes or street blocks? Constrain within wards, LADs? Same design criteria as 2001? BUT: balance against licensing issues Automation of processes
Census2011Geog project – Southampton University ESRC funded project Develop automated procedures for maintaining (splitting, merging, re-designing) 2001 output geographies to create 2011 output geographies for E&W Assess implications of using different building blocks (e.g. postcodes, street blocks) maintenance Work extended to January 2010
2001 OAs 2001 LSOAs Above upper threshold Within thresholds Below lower threshold Merge (merge 2001 OAs) Split (aggregate postcodes/ street blocks) 2011 OAs Append 2011 OAs Postcodes/Street blocks For a 2001 LAD/UA Merge all 2011 OAs from all LADs/UAs Automated maintenance procedures
Absolute population change (mid-year estimates) Camden Increase Decrease This work is based on data provided through EDINA UKBORDERS with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown.
Absolute population change (mid-year estimates) Liverpool Increase Decrease This work is based on data provided through EDINA UKBORDERS with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown.
Absolute population change (mid-year estimates) Manchester Increase Decrease This work is based on data provided through EDINA UKBORDERS with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown.
More information on OA Maintenance project at s.ac.uk s.ac.uk
Workplace Zones OAs based on where people live not work – can be unsuitable for workplace statistics Some OAs contain no/few businesses; some contain many businesses or large employer, e.g. business parks, City of London Workplace Zones project looking at splitting/merging OAs for a new geography nesting with OAs User Group established Pilot WZs to be created/evaluated 2010 Q2
2009 Output Geography consultation Need for an Upper layer SOA Workplace Zone requirements Provide instances of OAs/SOAs that are unfit as a statistical geography –Priority instances –Not useful for analysis due to their design –ONS panel to consider redesign
2009 Output Geography consultation Census Geography consultation part of Census Outputs consultation Runs for three months from November 2009 Follow up submissions January to May 2010
Conclusions contd 5.Greater flexibility in outputs i.Hypercube research 6.Multiple population bases 7.Geography i.Workplace zones ii.Possible production of data on two geographical bases 8.Application Programme Interface (API) i.Access to census data ii.Functionality of census data
Conclusions contd 9.Increased user input in consultation process i.Rounds of consultation ii.Online survey / persona research iii.Methods of engaging users Topic group experts Advisory groups Working groups Consulting users and distributors of census data Academic groups Direct consultation including output consultation events and internet