Anonymisation – a promise now broken? Ross Anderson Cambridge 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS Outline of talk How the statistician’s view of inference control differs from the security engineer’s Trackers Context and threat evolution Anonymity sets and privacy sets Active attacks Regulatory Failure Industry views and ways forward 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS Common ground Over the past 30 years we’ve both used stuff like: Query set size controls Perturbation Trimming outliers (e.g. the one guy with HIV in Chichester in 1996) Using larger scales for rarer attributes (strokes by medical practice, liver transplants nationally) The main difference: security engineers specialise in system-level adversarial thinking 13/11/2014 Cathie Marsh Lecture, RSS
The security researcher’s view The 1980 US census moved from static totals and samples to online queries Dorothy Denning wondered whether she could work out her boss’s salary Basic idea: to find the salary of the female professor of computer science, ask “average salary professors?” then “average salary male professors?” (or even non-professors) Such ‘tracker attacks’ are absolutely pervasive 13/11/2014 Cathie Marsh Lecture, RSS
The security researcher’s view (2) Contextual knowledge is really hard to deal with! For example in the key UK law case, Source Informatics (sanitised prescribing data): Week 1 Week 2 Week 3 Week 4 Doctor 1 17 21 15 19 Doctor 2 20 14 3 25 Doctor 3 18 26 13/11/2014 Cathie Marsh Lecture, RSS
The security researcher’s view (3) Context enables deductive reidentification Postcode sector plus year of birth plus almost anything else (e.g. date of discharge) ‘Show me all patients who’ve had an umbilical hernia repair and a left leg fracture tibia only’ ‘Show me all 42-yo women with 9-yo daughters where both have psoriasis’ If you link episodes into longitudinal records, most patients can be re-identified 13/11/2014 Cathie Marsh Lecture, RSS
The security researcher’s view (3) The dynamic aspects are really important! While a census is a snapshot every ten years, a medical database is constantly updated If you see the anonymised version but have access to some real-world context, fine-grained timing provides a fat covert channel E.g. the DeCode database in Iceland in 1998 had near-real-time updates 13/11/2014 Cathie Marsh Lecture, RSS
The security researcher’s view (4) The context is also dynamic! The scale of the data available for cross-referencing is growing by orders of magnitude Many websites are going ‘social’ Medical records are acquiring genomic data, as are ‘hobby’ websites like 23andme The move from PCs/laptops to phones/tablets makes movement and location data pervasive 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS Some terminology My anonymity set is the set of data subjects from whom I cannot be distinguished Depending on the application I might want an anonymity set size of 2, 3, 6, 50, 10000… My privacy set is the set of people whose opinion I care about This might be the Dunbar number, my Facebook friends, or the whole population if I’m a celeb 13/11/2014 Cathie Marsh Lecture, RSS
Privacy for mathematicians All I need is that no-one in my privacy set has an anonymity set for me that’s below the threshold for the sensitive attribute Problem 1: the privacy set is dynamic. If my kid gets kidnapped I’m an instant celeb … Problem 2: with every successive query, the anonymity set shrinks until all the headroom – the privacy budget – is spent (idea behind “Differential privacy” from Microsoft) 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS Privacy for lawyers For 35 years, we statisticians and security engineers have known that anonymity is hard (albeit with different terminologies and different research literatures) Lawyers and policymakers don’t read math or CS, so happy ignorance prevailed Then Paul Ohm “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” UCLA Law Review 2010 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS Privacy for CEOs PCAST (Presidential Committee of Advisers on Science & Technology) report on Big Data The spread of audio and gesture interfaces mean microphones and cameras in all inhabited spaces, reporting to ‘the cloud’ Anonymisation? “PCAST does not see it as a useful basis for policy” Wanted usage controls only; ECJ disagreed! The firms processing the data carry liability 13/11/2014 Cathie Marsh Lecture, RSS
Privacy for politicians First DP Act: “Maggie doesn’t approve of data privacy. It’s a German idea, like ID cards. But we have to do some to keep the Euros happy” Second DP Act: failure to transpose recital 26 ICO Anonymisation Code of Practice Ever harder pushback from Europe: Article 29 working party opinion 05/2014 DP regulation, currently in trilogue S8 ECHR, precedents such as I v Finland 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS What goes wrong (1) Cameron policy announced January 2011: make ‘anonymised’ data available to researchers, both academic and commercial, but with opt-out We’d already had a laptop stolen in London with 8.63m people’s ‘anonymised’ records on it In September 2012, CPRD went live – a gateway for making anonymised data available from (mostly) secondary care (now online in the USA!) GPES was to hoover up GP stuff into care.data 13/11/2014 Cathie Marsh Lecture, RSS
Advocating anonymisation 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS And ‘transparency’ 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS CPRD The clinical practice research datalink, run by the MHRA, makes some data available to researchers (even to guys like me :-) I made a freedom of information request for the anonymisation mechanisms Answer: sorry, that would undermine security Never heard of Kerckhoffs’ principle? Search for me, cprd on whatdotheyknow.com 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS The row over HES Hospital Episode Statistics (HES) has a record of every finished consultant episode going back 15 years (about a billion in total) Mar 13: formal complaint to ICO that PA put HES data in Google cloud despite many rules on moving identifable NHS data offshore Apr 3: HSCIC reveals that HES data sold to 1200 universities, firms and others since 2013 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS The row over HES Some HES records have name, address, … Some have only postcode, dob, … Some have this removed but still have “HESID” which usually contains postcode, dob Even if the HESID were encrypted, what about cardioversion, Hammersmith, Mar 19 2003? Yet the DoH describes pseudonymised HES data as “non-sensitive”! 13/11/2014 Cathie Marsh Lecture, RSS
ICO response to NGOs (29 Oct) We asked: ‘Precisely which patient data were stored outside the UK? Did they relate to single episodes or linked records? Did they contain postcode, date of birth, NHS number, or a pseudonym such as an encrypted NHS number?’ ICO: ‘Having met with and considered material provided by the HSCIC we reached the conclusion that the information provided by them to PA Consulting did not constitute personal data because a living individual could not be identified from the data.’ 13/11/2014 Cathie Marsh Lecture, RSS
ICO response to NGOs (continued) We pointed out: ‘Neither [HSCIC nor PA Consulting] has denied postcode, or year of birth, or the use of a pseudonym that would enable episode records to be linked.‘ ICO: ‘We also investigated whether this information had been linked to other data to produce “personal data” which was subject to the provisions of the Act. We asked PA Consulting whether they had done this. They confirmed that they had not. ’ 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS HES data bought by … Report items 40–42, 46–47, 62–66, 95–98, 159–162 … : selling data outside the NHS 191–2, 321, 329, 331, 362 … drug marketing 329, 336, … medical device marketing 408: Imperial College with HESID, date of birth, home address, GP practice: still marked "non sensitive” Many other uses for the data: market analysis, benchmarking, efficiency… 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS What works and doesn’t DoH policy – selling gigabytes of lightly anonymised personal records – doesn’t work Stable and well-designed applications like IMS Health can work, given adversarial testing Keeping big data in one place and being really careful, with big thresholds, can be less bad Privacy economics must be more sophisticated Obscurity doesn’t work though. We need to know what’s done with out data 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS To sum up … People used to assume that anonymisation and aggregation could protect privacy That assumption has been undermined since about 1980 by many technological changes First, engineers realised it didn’t work. Then lawyers. Now the politicians have screwed up so royally it’s just not credible Minimal fix: read Art 29 WP guidance, human rights law, and Kerckhoffs’ principle 13/11/2014 Cathie Marsh Lecture, RSS
Cathie Marsh Lecture, RSS An ethical way forward Rather than trying patch up with ‘risk management’, go back to first principles: Respect for persons Respect for the law, including ECHR Proper consultation with data subjects and others with morally relevant interests Proper reporting of uses, including breaches See Nuffield Bioethics Council report on Biodata, due out February, for details. 13/11/2014 Cathie Marsh Lecture, RSS