Differential Privacy on Linked Data: Theory and Implementation Yotam Aron.

Slides:



Advertisements
Similar presentations
The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Advertisements

CMPT 354 Views and Indexes Spring 2012 Instructor: Hassan Khosravi.
Building web applications on top of encrypted data using Mylar Presented by Tenglu Liang Tai Liu.
Differential Privacy on Linked Data: Theory and Implementation
Computer Science Dr. Peng NingCSC 774 Adv. Net. Security1 CSC 774 Advanced Network Security Topic 7.3 Secure and Resilient Location Discovery in Wireless.
Digital Signatures and Hash Functions. Digital Signatures.
© 2004 SafeNet, Inc. All rights reserved. Mobike Protocol Design draft-kivinen-mobike-design-00.txt Tero Kivinen
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Homework #4 Solutions Brian A. LaMacchia Portions © , Brian A. LaMacchia. This material is provided without.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
Security Internet Management & Security 06 Learning outcomes At the end of this session, you should be able to: –Describe the reasons for having system.
Security Management IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Security Internet Management & Security 06 Learning outcomes At the end of this session, you should be able to: –Describe the reasons for having system.
An Approach to Safe Object Sharing Ciaran Bryce & Chrislain Razafimahefa University of Geneva, Switzerland.
Security Management IACT 418/918 Autumn 2005 Gene Awyzio SITACS University of Wollongong.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Hippocratic Databases Paper by Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, Yirong Xu CS 681 Presented by Xi Hua March 1st,Spring05.
Firewalls and VPNS Team 9 Keith Elliot David Snyder Matthew While.
Database Auditing Models Dr. Gabriel. 2 Auditing Overview Audit examines: documentation that reflects (from business or individuals); actions, practices,
Chapter 7 Database Auditing Models
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
D ATABASE S ECURITY Proposed by Abdulrahman Aldekhelallah University of Scranton – CS521 Spring2015.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Linux Operations and Administration
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF Web link:
HIPAA PRIVACY AND SECURITY AWARENESS.
Firewalls Paper By: Vandana Bhardwaj. What this paper covers? Why you need a firewall? What is firewall? How does a network firewall interact with OSI.
Preventing SQL Injection Attacks in Stored Procedures Alex Hertz Chris Daiello CAP6135Dr. Cliff Zou University of Central Florida March 19, 2009.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Reasoning about Information Leakage and Adversarial Inference Matt Fredrikson 1.
Packet Filtering & Firewalls. Stateless Packet Filtering Assume We can classify a “good” packet and/or a “bad packet” Each rule can examine that single.
Database Security and Auditing: Protecting Data Integrity and Accessibility Chapter 7 Database Auditing Models.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Views Lesson 7.
SECURE WEB APPLICATIONS VIA AUTOMATIC PARTITIONING S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, X. Zheng Cornell University.
CS 4850: Senior Project Fall 2014 Object-Oriented Design.
Privacy Framework for RDF Data Mining Master’s Thesis Project Proposal By: Yotam Aron.
PRISM: Private Retrieval of the Internet’s Sensitive Metadata Ang ChenAndreas Haeberlen University of Pennsylvania.
TCP/IP (Transmission Control Protocol / Internet Protocol)
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
MEMBERSHIP AND IDENTITY Active server pages (ASP.NET) 1 Chapter-4.
1 Network Firewalls CSCI Web Security Spring 2003 Presented By Yasir Zahur.
Csci5233 Computer Security & Integrity 1 Overview of Security & Java (based on GS: Ch. 1)
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Chapter 5 : Integrity And Security  Domain Constraints  Referential Integrity  Security  Triggers  Authorization  Authorization in SQL  Views 
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Security API discussion Group Name: SEC Source: Shingo Fujimoto, FUJITSU Meeting Date: Agenda Item: Security API.
Accounting in DataGrid HLR software demo Andrea Guarise Milano, September 11, 2001.
Firewalls. Overview of Firewalls As the name implies, a firewall acts to provide secured access between two networks A firewall may be implemented as.
Information Security, Theory and Practice.
The Top 10 Reasons Why Federated Can’t Succeed
Differential Privacy in Practice
Inference and Flow Control
Autonomous Aggregate Data Analytics in Untrusted Cloud
Internet Control Message Protocol
Published in: IEEE Transactions on Industrial Informatics
Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
Software Development Techniques
TCP/IP Protocol Suite 1 Chapter 9 Upon completion you will be able to: Internet Control Message Protocol Be familiar with the ICMP message format Know.
Best Practices in Higher Education Student Data Warehousing Forum
Differential Privacy (1)
Presentation transcript:

Differential Privacy on Linked Data: Theory and Implementation Yotam Aron

Table of Contents Introduction Differential Privacy for Linked Data SPIM implementation Evaluation

Contributions Theory on how to apply differential privacy to linked data. Experimental implementation of differential privacy on linked data. Overall privacy module for SPARQL queries.

Introduction

Overview: Why Privacy Risk? Statistical data can leak privacy. Mosaic Theory: Different data sources harmful when combined. Examples: Netflix Prize Data set GIC Medical Data set AOL Data logs Linked data has added ontologies and meta-data, making it even more vulnerable. Linked data has added ontologies and meta-data, making it even more vulnerable

Current Solutions Accountability: Privacy Ontologies Privacy Policies and Laws Problems: Requires agreement among parties. Does not actually prevent breaches, just a deterent. Heterogeneous

Current Solutions (Cont’d) Anonymization Delete “private” data K – anonymity (Strong Privacy Guarantee) K – anonymity Problems Deletion provides no strong guarantees Must be carried out for every data set What data should be anonymized? High computational cost (k-anonimity is np-hard)(k-anonimity is np-hard)

Differential Privacy

How Achieved? Add noise to result. Simplest: Add Laplace noise

Laplace Noise Parameters

Other Benefit of Laplace Noise

Benefits of Differential Privacy Strong Privacy Guarantee Mechanism-Based, so don’t have to mess with data. Independent of data set’s structure. Works well with for statistical analysis algorithms.

Problems with Differential Privacy Potentially poor performance Complexity (especially for non-linear functions) Noise Only works with statistical data (though this has fixes) How to calculate sensitivity of arbitrary query?

Differential Privacy for Linked Data

Differential Privacy and Linked Data Want same privacy guarantees for linked data without, but no “records.” What should be “unit of difference”? One triple All URIs related to person’s URI All links going out from person’s URI

Differential Privacy and Linked Data Want same privacy guarantees for linked data without, but no “records.” What should be “unit of difference”? One triple All URIs related to person’s URI All links going out from person’s URI

Differential Privacy and Linked Data Want same privacy guarantees for linked data without, but no “records.” What should be “unit of difference”? One triple All URIs related to person’s URI All links going out from person’s URI

Differential Privacy and Linked Data Want same privacy guarantees for linked data without, but no “records.” What should be “unit of difference”? One triple All URIs related to person’s URI All links going out from person’s URI

“Records” for Linked Data Reduce links in graph to attributes Idea: Identify individual contributions from a single individual to total answer. Find contribution that affects answer most.

“Records” for Linked Data Reduce links in graph to attributes, makes it a record. P1P2 Knows PersonKnows P1P2

“Records” for Linked Data Repeated attributes and null values allowed P1P2 Knows P3P4 Loves Knows

“Records” for Linked Data Repeated attributes and null values allowed (not good RDBMS form but makes definitions easier) PersonKnows Loves P1P2NullP4 P3P2P4Null

Query Sensitivity in Practice Need to find triples that “belong” to a person. Idea: Identify individual contributions from a single individual to total answer. Find contribution that affects answer most. Done using sorting and limiting functions in SPARQL

Example COUNT of places visited P1 P2 MA S2 S3 State of Residence S1 Visited

Example COUNT of places visited P1 P2 MA S2 S3 State of Residence S1 Visited

Example COUNT of places visited P1 P2 MA S2 S3 State of Residence S1 Visited Answer: Sensitivity of 2

Using SPARQL Query: (COUNT(?s) as ?num_places_visited) WHERE{ ?p :visited ?s }

Using SPARQL Sensitivity Calculation Query (Ideally): SELECT ?p (COUNT(ABS(?s)) as ?num_places_visited) WHERE{ ?p :visited ?s; ?p foaf:name ?n } GROUP BY ?p ORDER BY ?num_places_visited LIMIT 1

In reality… LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store… For now: Don’t use LIMIT and get top answers manually. I.e. Simulate using these in python Would like to keep it on sparql-side ideally so there is less transmitted data (e.g. on large data sets)

(Side rant) 4store limitations Many operations not supported in unison E.g. cannot always filter and use “order by” for some reason Severely limits the types of queries I could use to test. May be desirable to work with a different triplestore that is more up-to-date (ARQ). Didn’t because wanted to keep code in python. Also had already written all code for 4store

Problems with this Approach Need to identify “people” in graph. Assume, for example, that URI with a foaf:name is a person and use its triples in privacy calculations. Imposes some constraints on linked data format for this to work. For future work, maybe there’s a way to automatically identify private data, maybe by using ontologies. Complexity is tied to speed of performing query over large data set.

…and on the Plus Side Model for sensitivity calculation can be expanded to arbitrary statistical functions. e.g. dot products, distance functions, etc. Relatively simple to implement using SPARQL 1.1

Differential Privacy Protocol Differential Privacy Module Client SPARQL Endpoint Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.

Differential Privacy Protocol Differential Privacy Module Client SPARQL Endpoint Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module. Query, Ɛ > 0

Differential Privacy Protocol Differential Privacy Module Client SPARQL Endpoint Step 2: The sensitivity of the query is calculated using a re-written, related query. Sens Query

Differential Privacy Protocol Differential Privacy Module Client SPARQL Endpoint Step 3: Actual query sent. Query

Differential Privacy Protocol Differential Privacy Module Client SPARQL Endpoint Step 4: Result with Laplace noise sent over. Result and Noise

Design of Privacy System

SPARQL Privacy Insurance Module i.e. SPIM Use authentication, AIR, and differential privacy in one system. Authentication to manage Ɛ-budgets. AIR to control flow of information and non-statistical data. Differential privacy for statistics. Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.

Design Triplestore User Data Privacy Policies SPIM Main Process AIR Reasoner Differential Privacy Module HTTP Server OpenID Authentication

HTTP Server and Authentication HTTP Server: Django server that handles http requests. OpenID Authentication: Django module. HTTP Server OpenID Authentication

SPIM Main Process Controls flow of information. First checks user’s budget, then uses AIR, then performs final differentially-private query. SPIM Main Process

AIR Reasoner Performs access control by translating SPARQL queries to n3 and checking against policies. Can potentially perform more complicated operations (e.g. check user credentials) Privacy Policies AIR Reasoner

Differential Privacy Works as discussed in previous slides. Contains users and their Ɛ- values. Differential Privacy Module User Data

Evaluation

Three things to evaluate: Correctness of operation Correctness of differential privacy Runtime Used a anonymized clinical database as the test data and added fake names, social security numbers, and addresses.

Correctness of Operation Can the system do what we want? Authentication provides access control AIR restricts information and types of queries Differential privacy gives strong privacy guarantees. Can we do better?

Use Case Used in Thesis Clinical database data protection HIPAA: Federal protection of private information fields, such as name and social security number, for patients. 3 users Alice: Works in CDC, needs unhindered access Bob: Researcher that needs access to private fields (e.g. addresses) Charlie: Amateur researcher to whom HIPAA should apply Assumptions: Django is secure enough to handle “clever attacks” Users do not collude, so can allocate individual epsilon values.

Use Case Solution Overview What should happen: Dynamically apply different AIR policies at runtime. Give different epsilon-budgets. How allocated: Alice: No AIR Policy, no noise. Bob: Give access to addresses but hide all other private information fields. Epsilon budget: E1 Charlie: Hide all private information fields in accordance with HIPAA Epsilon budget: E2

Use Case Solution Overview Alice: No AIR Policy Bob: Give access to addresses but hide all other private information fields. Epsilon budget: E1 Charlie: Hide all private information fields in accordance with HIPAA Epsilon budget: E2

Example: A Clinical Database Client Accesses triplestore via HTTP server. OpenID Authentication verifies user has access to data. Finds epsilon value, HTTP Server OpenID Authentication

Example: A Clinical Database AIR reasoner checks incoming queries for HIPAA violations. Privacy policies contain HIPAA rules. Privacy Policies AIR Reasoner

Example: A Clinical Database Differential Privacy applied to statistical queries. Statistical result + noise returned to client. Differential Privacy Module

Correctness of Differential Privacy Need to test how much noise is added. Too much noise = poor results. Too little noise = no guarantee. Test: Run queries and look at sensitivity calculated vs. actual sensitivity.

How to test sensitivity? Ideally: Test noise calculation is correct Test that noise makes data still useful (e.g. by applying machine learning algorithms). Fort his project, just tested former Machine learning APIs not as prevalent for linked data. What results to compare to?

Test suite 10 queries for each operation (COUNT, SUM, AVG, MIN, MAX) 10 different WHERE CLAUSES Test: Sensitivity calculated from original query Remove each personal URI using “MINUS” keyword and see which removal is most sensitive

Example for Sens Test Query: PREFIX rdf: PREFIX rdfs: PREFIX foaf: PREFIX mimic: SELECT (SUM(?o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o)) }

Example for Sens Test Sensitivity query: PREFIX rdf: PREFIX rdfs: PREFIX foaf: PREFIX mimic: SELECT (SUM(?o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o)) MINUS {?s foaf:name "%s"} } % (name)

Results Query 6 - Error

Runtime Queries were also tested for runtime. Bigger WHERE clauses More keywords Extra overhead of doing the calculations.

Results Query 6 - Runtime

Interpretation Sensitivity calculation time on-par with query time Might not be good for big data Find ways to reduce sensitivity calculation time? AVG does not do so well… Approximation yields too much noise vs. trying all possibilities Runs ~4x slower than simple querying Solution 1: Look at all data manually (large data transfer) Solution 2: Can we use NOISY_SUM / NOISY_COUNT instead?

Conclusion

Contributions Theory on how to apply differential privacy to linked data. Experimental implementation of differential privacy. Verification that it is applied correctly. Overall privacy module for SPARQL queries. Limited but a good start Other: Updated sparql to n3 translation to Sparql version 1.1 Expanded upon IARPA project to create policies against statistical queries.

Shortcomings and Future Work Triplestores need some structure for this to work Personal information must be explicitly defined in triples. Is there a way to automatically detect what triples would constitute private information? Complexity Lots of noise for sparse data. Can divide data into disjoint sets to reduce noise like PINQ does Use localized sensitivity measures? Third party software problems Would this work better using a different Triplestore implementation?

Other work Other implementations: PINQ Airavat PDDP Some of the Theoretical Work Out There Differential privacy paper Exponential Mechanism Noise Calculation Differential Privacy and Machine Learning

Appendix: Results Q1, Q2 Q2ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN

Appendix: Results Q3, Q4 Q3ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN Q4ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN

Appendix: Results Q5, Q6 Q5ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN Q6ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN

Appendix: Results Q7, Q8 Q7ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN Q8ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN

Appendix: Results Q9, Q10 Q9ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN Q10ErrorQuery_TimeSens_Calc_Time COUNT SUM AVG MAX MIN