Download presentation
Presentation is loading. Please wait.
1
Privacy Statistics and Data Linkage
Mark Elliot Confidentiality and Privacy Group University of Manchester
2
Overview The disclosure risk problem Some e-science possibilities
Monitored data access Grid based Data environment Analysis The meaning of privacy
3
Data Data Everywhere… Massive and exponential increase in data; Mackey and Purdam(2002); Purdam and Elliot(2002). These studies have led to the setting up of the data monitoring service. Singer(1999) noted three behavioural tendencies: Collect more information on each population unit Replace aggregate data with person specific databases Given the opportunity collect personal information Purdam and Elliot add: Link data whenever you can
4
Disclosure Risk I: Microdata
5
The Disclosure Risk Problem: Type I: Identification
Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file ID variables Key variables Target variables
6
Disclosure Risk II: Aggregate Tables of Counts
7
The Disclosure Risk Problem: Type II: Attribution
8
The Disclosure Risk Problem: Type II: Attribution
9
The Disclosure Risk Problem: Type II: Attribution
10
Multiple datasets Disclosure Risk assessment for single datasets is a reasonably understood problem. But what happens with multiple datasets?
11
Data Mining and the Grid
Traditional Data Mining examines and identifies patterns on single (if massive) datasets. But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.
12
Smith and Elliot (2005,06,07) Increases in data availability lead inexorably to an increase in disclosure risk My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z. It’s all about information!
13
CLEF: Clinical e-Science Framework
A solution involving monitored access
14
CLEF Consortium Approximately 40 Staff from University of Manchester
University of Sheffield University College London University of Brighton Royal Marsden Hospital, London
15
Purpose To provide a system for allowing research access to patient data, whilst maintaining privacy. Patient records Database Texts such as referral letters and other clinical texts Text mining system convert to microdata
16
CLEF one possible architecture
Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench
17
Data Sentry: an AI system
Monitors patterns of analytical requests 3 levels: users, institution, world. Looking for intrusive patterns. Numbers of requests Stores Analytical requests for future use.
18
CLEF Proposed Architecture
Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench
19
Data Quality User analyses are run on both treated and untreated data.
Outputs are compared and assessed for difference. Major research area – Knowledge Engineering Analyses are stored and collectively run over pre and post SDC files for assessment of impact.
20
The Grid: the context for massive combining.
“Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002) Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002) Data grid Knowledge grid
21
Grid based Data Environment Analysis
22
What’s it about? Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. This is a bit like evaluating the risk of a house being vulnerable to flooding without looking at where it is located! Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..
23
What would it involve? Web Crawling Data Monitoring
Synthetic Data Generation Grid based disclosure risk analysis
24
Web crawling Untrained Screen scraping of all web sites that collect personal data. Generic info gathering of web published personal info (personal web pages, My space etc)
25
Data Monitoring The development of sophisticated metadatabases representing available info fields Combined Database of web available data. Involves intelligent interpretation of web data, record linkage and other AI crossover techniques.
26
Repository: Data & Metadata
Architecture Web Crawler Web Crawler Web Crawler Web Crawler Web Crawler SDRA system Synthesiser Data monitor Repository: Data & Metadata
27
What next? Decide on roles. Identify funder.
Develop grant application.
28
Synthetic Data Generation
Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.
29
Closing thoughts
30
A Blurring of Concepts The boundaries between data and processes become less distinct. Cyberidenties I am my data? The distinction between informational and physical privacy becomes less distinct.
31
Data Growth There is no reason to suppose that data growth will not continue at the same break neck pace The data environment will become increasingly richer In this context the meaning of “privacy” will undoubtedly change. But how?
32
The meaning of Privacy Do people care about privacy in an orthodox, absolute sense? What does a blog mean? Private-public: Public Privacy Control and ownership are more important than the absolute right to secrecy.
33
From Data Subjects to Data Citizens
A data actualised individual in control and self aware of their own data. What would data citizens be concerned about? Ownership The use/abuse of their data Harm Permission/Consent This suggests that the law should focus on data abuse rather than privacy per se.
34
Summary Statistical Disclosure prevents a problem for the use of data
Multiple linkable datasets exacerbate that problem. E-science provides some tools for new modes of data access
35
But….. Assuming that the global culture continues to feed and be fed by the information explosion: Our view of ourselves/our data will/must change. The meaning of privacy must change with it. The key question is what sort of society we are constructing; the meaning of privacy will reflect this.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.