Privacy Statistics and Data Linkage

Slides:



Advertisements
Similar presentations
Grid Security/Edinburgh 5 th & 6 th December 2002 Confidentiality, Consent & Access Peter Singleton - Cambridge Health Informatics.
Advertisements

Data Monitoring Confidentiality and the Grid Mark Elliot Confidentiality And Privacy Group ( University of Manchester.
An ePortfolio System for Life Shane Sutherland ePortfolio Project Director University of Wolverhampton.
Managing Knowledge in the Digital Firm (II) Soetam Rizky.
The Challenge of the New Data Mark Elliot, Social Sciences University of Manchester January 2013
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
Improving Cybersecurity Through Research & Innovation Dr. Steve Purser Head of Technical Competence Department European Network and Information Security.
CHAPTER 7 Roderick Dickson Kelli Grubb Tracyann Pryce Shakita White.
Management Information Systems, Sixth Edition
Faculty of Computer Science © 2006 CMPUT 605February 11, 2008 A Data Warehouse Architecture for Clinical Data Warehousing Tony R. Sahama and Peter R. Croll.
Module 1: Key concepts in data demand & use
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
CSCE101 –Chapter 8 Thursday, November 30, Compression MP3 players – MP3 is a compression technology that reduces the size of an audio file to 1/10.
introduction to MSc projects
Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Toward Knowledge Discovery in Databases Attached to Grids Peter Brezany Institute for Software.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Introduction to Systems Analysis and Design
Information Literacy – Are you prepared or paralysed Heather Strachan NMAHP Clinical Lead Scottish Government
LEVERAGING THE ENTERPRISE INFORMATION ENVIRONMENT Louise Edmonds Senior Manager Information Management ACT Health.
Introduction to databases Developed by Anna Feldman for the Association for Progressive Communications (APC)
Databases & Data Warehouses Chapter 3 Database Processing.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
RSBM Business School Research in the real world: the users dilemma Dr Gill Green.
Digital Identity Management Strategy, Policies and Architecture Kent Percival A presentation to the Information Services Committee.
Management Information Systems
Discovering Computers Fundamentals, 2012 Edition Your Interactive Guide to the Digital World.
Case Studies: Statistics Canada (WP 11) Alice Born Statistics UNECE Workshop on Statistical Metadata.
Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Organizational Memory: Issues in Design & Implementation Sree Nilakanta May 1, 2000.
Confidentiality and Security Issues in ART & MTCT Clinical Monitoring Systems Meade Morgan and Xen Santas Informatics Team Surveillance and Infrastructure.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Week 4 Lecture Part 3 of 3 Database Design Samuel ConnSamuel Conn, Faculty Suggestions for using the Lecture Slides.
Dissemination to support Research & Analysis John Cornish.
Configuration Management (CM)
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.
VIRTUAL WORLDS IN EDUCATIONAL RESEARCH © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Whose Responsibility is it? Karen Korb TELUS Health Solutions November 24, 2009 Privacy and Confidentiality in the EHR:
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Statistics New Zealand's Move to Process-oriented Statistics Production Julia Gretton and Tracey Savage IAOS Conference Shanghai, China, October 2008.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Business model Transformation Strategy (BmTS) John Pearson and Tracey Savage Statistics NZ’s.
Overview of the DAME Project Distributed Aircraft Maintenance Environment University of York Martyn Fletcher.
Nursing Informatics NI.
HELPING TRAINEES REFLECT KATE WISHART AUTUMN SEMINAR 2015.
Chapter 1 Introduction to Systems Design and Analysis Systems Analysis and Design Kendall and Kendall Sixth Edition.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE September Integrating Policy with Applications.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Development of UK Virtual Microdata Laboratory Felix Ritchie Shanghai, March 2010.
Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.
Using SAS Stored Processes and the SAS Portal for Delivering Statistics to Drug Discovery Volker Harm PhUSE/PSI One-day Event 2009, Marlow.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Chapter 1 Assuming the Role of the Systems Analyst.
FALLS PROJECT Falls Auditing  Falls audits in the care homes had traditionally focused upon the number of falls per month – was a paper exercise.
Parallel Sessions: Pathways & Prediction
Development of UK Virtual Microdata Laboratory
Detection and Analysis of Threats to the Energy Sector (DATES)
Anonymisation: Theory and Practice
Data Warehouse.
Presentation 2b 2018 Census Products & Services Engagement.
Federal Statistical Office Germany Research Data Centre
Presentation transcript:

Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester

Overview The disclosure risk problem Some e-science possibilities Monitored data access Grid based Data environment Analysis The meaning of privacy

Data Data Everywhere… Massive and exponential increase in data; Mackey and Purdam(2002); Purdam and Elliot(2002). These studies have led to the setting up of the data monitoring service. Singer(1999) noted three behavioural tendencies: Collect more information on each population unit Replace aggregate data with person specific databases Given the opportunity collect personal information Purdam and Elliot add: Link data whenever you can

Disclosure Risk I: Microdata

The Disclosure Risk Problem: Type I: Identification Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file ID variables Key variables Target variables

Disclosure Risk II: Aggregate Tables of Counts

The Disclosure Risk Problem: Type II: Attribution

The Disclosure Risk Problem: Type II: Attribution

The Disclosure Risk Problem: Type II: Attribution

Multiple datasets Disclosure Risk assessment for single datasets is a reasonably understood problem. But what happens with multiple datasets?

Data Mining and the Grid Traditional Data Mining examines and identifies patterns on single (if massive) datasets. But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.

Smith and Elliot (2005,06,07) Increases in data availability lead inexorably to an increase in disclosure risk My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z. It’s all about information!

CLEF: Clinical e-Science Framework A solution involving monitored access

CLEF Consortium Approximately 40 Staff from University of Manchester University of Sheffield University College London University of Brighton Royal Marsden Hospital, London

Purpose To provide a system for allowing research access to patient data, whilst maintaining privacy. Patient records Database Texts such as referral letters and other clinical texts Text mining system convert to microdata

CLEF one possible architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench

Data Sentry: an AI system Monitors patterns of analytical requests 3 levels: users, institution, world. Looking for intrusive patterns. Numbers of requests Stores Analytical requests for future use.

CLEF Proposed Architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench

Data Quality User analyses are run on both treated and untreated data. Outputs are compared and assessed for difference. Major research area – Knowledge Engineering Analyses are stored and collectively run over pre and post SDC files for assessment of impact.

The Grid: the context for massive combining. “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002) Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002) Data grid Knowledge grid

Grid based Data Environment Analysis

What’s it about? Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. This is a bit like evaluating the risk of a house being vulnerable to flooding without looking at where it is located! Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..

What would it involve? Web Crawling Data Monitoring Synthetic Data Generation Grid based disclosure risk analysis

Web crawling Untrained Screen scraping of all web sites that collect personal data. Generic info gathering of web published personal info (personal web pages, My space etc)

Data Monitoring The development of sophisticated metadatabases representing available info fields Combined Database of web available data. Involves intelligent interpretation of web data, record linkage and other AI crossover techniques.

Repository: Data & Metadata Architecture Web Crawler Web Crawler Web Crawler Web Crawler Web Crawler SDRA system Synthesiser Data monitor Repository: Data & Metadata

What next? Decide on roles. Identify funder. Develop grant application.

Synthetic Data Generation Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.

Closing thoughts

A Blurring of Concepts The boundaries between data and processes become less distinct. Cyberidenties I am my data? The distinction between informational and physical privacy becomes less distinct.

Data Growth There is no reason to suppose that data growth will not continue at the same break neck pace The data environment will become increasingly richer In this context the meaning of “privacy” will undoubtedly change. But how?

The meaning of Privacy Do people care about privacy in an orthodox, absolute sense? What does a blog mean? Private-public: Public Privacy Control and ownership are more important than the absolute right to secrecy.

From Data Subjects to Data Citizens A data actualised individual in control and self aware of their own data. What would data citizens be concerned about? Ownership The use/abuse of their data Harm Permission/Consent This suggests that the law should focus on data abuse rather than privacy per se.

Summary Statistical Disclosure prevents a problem for the use of data Multiple linkable datasets exacerbate that problem. E-science provides some tools for new modes of data access

But….. Assuming that the global culture continues to feed and be fed by the information explosion: Our view of ourselves/our data will/must change. The meaning of privacy must change with it. The key question is what sort of society we are constructing; the meaning of privacy will reflect this.