Privacy-safe Data Sharing. Why Share Data? Hospitals share data with researchers – Learn about disease causes, promising treatments, correlations between.

Slides:



Advertisements
Similar presentations
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Advertisements

Chapter 10: Designing Databases
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
21-1 Last time Database Security  Data Inference  Statistical Inference  Controls against Inference Multilevel Security Databases  Separation  Integrity.
Ragib Hasan Johns Hopkins University en Spring 2011 Lecture 8 04/04/2011 Security and Privacy in Cloud Computing.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Attacking Session Management Juliette Lessing
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
 Guarantee that EK is safe  Yes because it is stored in and used by hw only  No because it can be obtained if someone has physical access but this can.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
Criticisms of I3 Jack Lange. General Issues ► Design ► Performance ► Practicality.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
D ATABASE S ECURITY Proposed by Abdulrahman Aldekhelallah University of Scranton – CS521 Spring2015.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Signatures As Threats to Privacy Brian Neil Levine Assistant Professor Dept. of Computer Science UMass Amherst.
INTERNET APPLICATION DEVELOPMENT For More visit:
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 7: Transport Layer Introduction to Networking.
OHT 11.1 © Marketing Insights Limited 2004 Chapter 9 Analysis and Design EC Security.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Computer Security: Principles and Practice
Refined privacy models
Chapter No 4 Query optimization and Data Integrity & Security.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Slide 15-1 Copyright © 2004 Pearson Education, Inc.
HSC IT Center Training University of Florida Microsoft Access Understanding Relationships Health Science Center IT Center – Training
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
ITGS Databases.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
CSC Intro. to Computing Lecture 10: Databases.
PRISM: Private Retrieval of the Internet’s Sensitive Metadata Ang ChenAndreas Haeberlen University of Pennsylvania.
Preserving Privacy in GPS Traces via Uncertainty- Aware Path Cloaking Baik Hoh, Marco Gruteser, Hui Xiong, Ansaf Alrabady Presented by Joseph T. Meyerowitz.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Fonkey Project Update: Target Applications TechSec WG, RIPE-45 May 14, 2003 Yuri Demchenko.
Privacy-preserving data publishing
A Whirlwind Tour of Differential Privacy
Location Privacy Protection for Location-based Services CS587x Lecture Department of Computer Science Iowa State University.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Unraveling an old cloak: k-anonymity for location privacy
Netprog: Chat1 Chat Issues and Ideas for Service Design Refs: RFC 1459 (IRC)
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
PHP: Further Skills 02 By Trevor Adams. Topics covered Persistence What is it? Why do we need it? Basic Persistence Hidden form fields Query strings Cookies.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
FIREWALLS By k.shivakumar 08k81f0025. CONTENTS Introduction. What is firewall? Hardware vs. software firewalls. Working of a software firewalls. Firewall.
Block 5: An application layer protocol: HTTP
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Chapter 5: Structural Modeling
Anonymous Communication
Privacy-preserving Release of Statistics: Differential Privacy
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Differential Privacy in Practice
Chat Refs: RFC 1459 (IRC).
Anonymous Communication
Presented by : SaiVenkatanikhil Nimmagadda
Published in: IEEE Transactions on Industrial Informatics
Gentle Measurement of Quantum States and Differential Privacy *
Anonymous Communication
Refined privacy models
Presentation transcript:

Privacy-safe Data Sharing

Why Share Data? Hospitals share data with researchers – Learn about disease causes, promising treatments, correlations between symptoms and outcomes Merchants share data with advertisers/researchers/public – Learn what people like to buy, when and why Networks share data with public – Learn about traffic trends – Optimize network usage

How To Share Data? Anonymize – Remove names, aggregate addresses (e.g., ZIP), remove packet data (could contain identifying or sensitive information in the payload) Anonymization does not fully protect privacy – Patterns are still available in the data – A pattern may be unique and thus identifying

Example: Netflix Prize A user saw “Rise of Titans”, “Frozen”, “Obscure Movie A”, “Unpopular Movie B”, “Sensitive Movie C” – There may be no other user who saw “Obscure Movie A” and “Unpopular Movie B” – identifying pattern – What else can we learn: User saw two popular movies that lots of people see User saw the “Sensitive Movie C”, could reveal political, sexual, religious orientation

Example: Netflix Prize But how do we learn identity of the user? – Could have reviewed “Obscure Movie A” and “Unpopular Movie B” on IMDB – Could have talked about “Obscure Movie A” and “Unpopular Movie B” with his friends

Why Anonymization Doesn’t Work Unique patterns in the data – Movies seen, diseases had, dates of visits to hospitals, traffic patterns in network data Auxiliary data – Enables linking of identity with unique pattern This attack doesn’t work on all users – But works on a few, and that’s enough – It may help deanonymize other records

Lots of Research In DB Field Sharing census data, medical records – k-anonymity, l-diversity, t-closeness Differential privacy These approaches are now being used in other fields for privacy-safe data sharing Sampling

Example: Health Care Data NameAgeZIPDisease Jane Smith Heart disease Jane Doe Flu John Smith Cancer John Doe Cancer Jack Smith Heart disease Jack Doe Flu

Anonymized Ages and ZIPs are all unique – Can be used for identification AgeZIPDisease Heart disease Flu Cancer Cancer Heart disease Flu

k-anonymity Hiding in a crowd Remove or aggregate data that pertains to fewer than k individuals, e.g. k=2 AgeZIPDisease **Heart disease **Flu **Cancer **Cancer **Heart disease **Flu People from 902** ZIP have cancer

l-diversity Remove or aggregate data that pertains to fewer than k individuals – But ensure that each group has at least l different values of sensitive attribute, e.g. l=2 AgeZIPDisease 0-209****Heart disease 0-209****Flu ****Cancer ****Heart disease ****Cancer ****Flu If one knows that Jane Doe is 12, they can learn from the table that she does not have cancer.

t-closeness The distance between the distribution of a sensitive attribute in a group and in the whole table is at most t, e.g., t=0 AgeZIPDisease ****Heart disease ****Flu ****Cancer ****Heart disease ****Cancer ****Flu Notice the drastic loss of information! Age or ZIP are all the same now and not correlated with disease.

Sampling Don’t release the entire data, just release sampled data – Data would still need to be anonymized – This way an individual is protected because any given record has a low chance to be sampled Attacks from previous slides do not work since attacker cannot ensure they saw all the data Attacker doesn’t know if a given individual has been selected – If an individual with a rare pattern is selected, they have no privacy protection

Differential Privacy Definition 1: Differential privacy goal. Any given disclosure will be, within a small multiplicative factor, just as likely whether or not the individual participates in the database. – Thus an individual does not lose anything by participating

Differential Privacy Definition 2: Differential privacy. A randomized algorithm M gives ε-differential privacy if for all data sets D1 and D2 differing on at most one element, and all S ⊆ Range(M), it holds that Pr[M(D1) ∈ S] ≤ exp(ε) × Pr[M(D2) ∈ S]. So probabilities of some query’s result on a dataset with and without any single record are within ε of each other

Differential Privacy Definition 3: The Laplace mechanism. Given any function f: N |X| → R k, the Laplace mechanism is defined as: M L (x, f (), ε) = f (x) + (Y 1,..., Y k ), where Y i are i.i.d. random variables drawn from Laplace distribution with mean (∆f/ε), and ∆f is l1-sensitivity of the function f, i.e., the biggest change in f that can occur by addition or removal of a single row in X. Note that ∆f is the property of the function f and not of the database X.

What Can We Do With Diff. Privacy? Run queries on data and return results, which are differentially private Use it with ML to learn rules, which do not leak data about any individual Generate synthetic data, which is differentially private

What is Diff. Privacy Good For? Single-Identity-Single-Record data – E.g., census, population data, birth and death records, data on people undergoing a given treatment Most frequent type of query – counts, histograms, e.g., SELECT count(*) FROM DB WHERE attribute=value ∆f=1, ε=0.1, added noise is a random number ∈ [1-10] – Doesn’t change the distribution much

What is Diff. Privacy Bad For? Single-Identity-Multiple-Record data – E.g., hospital visits, movie rentals, network traffic Heavy-tail data: A small number of individuals have a large contribution to data ∆f>>1, ε=0.1, added noise is a random number ∈ [1-HUGENum] – No utility left in the data

Online k-anonymity Avoids many pitfalls of k-anonymity by letting users run queries on data, and just returning the output – Each data point is checked to verify that it pertains to at least k identities – No need for l-diversity and t-closeness Data provider can restrict which queries can be ran on which data fields – E.g., no direct comparison on name field

Online k-anonymity The system can keep track about data flow and identify cases where one tries to pinpoint a group of fewer than k identities (tracker attacks) – E.g., count of people with salary<130K, count of people with salary<129K Hiding in a group of k people – Simple concept, easier to understand than how to properly set up ε value

What Is Still Difficult? Aggregate queries, which include heavy-tail data – E.g., sum of visits for cancer treatment, sum of packets on a given service port Heavy tail dominates the result, enables identification – E.g., if 5 hosts receive packets on port 80, with counts 100, 200, 300, 400 and 50,000 the last value dominates the sum We can solve this by testing for heavy-tail and enforcing k-anonymity on heavy tail

What Is Still Difficult? Should we omit or merge values? – Omit loses a lot of data – Merge might reveal sensitive data, e.g., merging ports 22 and 30 results in range 22-30, leaks data about edge values Better to merge within pre-defined ranges, also loses utility Heavy-tail data should be merged with different-sized ranges than uniformly distributed data

Content-Rich Traffic Repository from Real-time Anonymous User Contributions – Users contribute their data to research – Data is recorded and remains at user machine (no loss of ownership, no liability issues) Researchers connect to Critter server – Ask queries in SQL-like language Users poll the server for queries, reply with yes/no or counts – Server uses k-anonymity to aggregate query responses

Critter Architecture

Query Process

Current data: HTTP but not HTTPS Python-based client uses libpcap to collect packets – Can be easily extended to other applications Data is organized into connection and session tables, with lots of metadata

Field NameDescription noAuto Increment Primary Key timestmpTime Stamp of the last TCP Packet assembled for HTTP content. page_idIdentifier for a text/html page and its objects tcp_session_idIdentifier for all HTTP content in a TCP session browsing_session_idIdentifer for TCP sessions linked together sourceSource_IP:Port destinationDest_IP:Port http_typeRequest/Response hostDomain Name urlRelative Path refererReferred By cookieCookie field for Request and 'Set Cookie' for Response. content_typetext/html, image/jpeg etc. no_childrenNumber of images payloadGZIP decoded ASCII payload hrefshrefs list iframesiframes list imagesimages list