Download presentation
Presentation is loading. Please wait.
Published byBarbara Paul Modified over 8 years ago
1
Privacy-safe Data Sharing
2
Why Share Data? Hospitals share data with researchers – Learn about disease causes, promising treatments, correlations between symptoms and outcomes Merchants share data with advertisers/researchers/public – Learn what people like to buy, when and why Networks share data with public – Learn about traffic trends – Optimize network usage
3
How To Share Data? Anonymize – Remove names, aggregate addresses (e.g., ZIP), remove packet data (could contain identifying or sensitive information in the payload) Anonymization does not fully protect privacy – Patterns are still available in the data – A pattern may be unique and thus identifying
4
Example: Netflix Prize A user saw “Rise of Titans”, “Frozen”, “Obscure Movie A”, “Unpopular Movie B”, “Sensitive Movie C” – There may be no other user who saw “Obscure Movie A” and “Unpopular Movie B” – identifying pattern – What else can we learn: User saw two popular movies that lots of people see User saw the “Sensitive Movie C”, could reveal political, sexual, religious orientation
5
Example: Netflix Prize But how do we learn identity of the user? – Could have reviewed “Obscure Movie A” and “Unpopular Movie B” on IMDB – Could have talked about “Obscure Movie A” and “Unpopular Movie B” with his friends
6
Why Anonymization Doesn’t Work Unique patterns in the data – Movies seen, diseases had, dates of visits to hospitals, traffic patterns in network data Auxiliary data – Enables linking of identity with unique pattern This attack doesn’t work on all users – But works on a few, and that’s enough – It may help deanonymize other records
7
Lots of Research In DB Field Sharing census data, medical records – k-anonymity, l-diversity, t-closeness Differential privacy These approaches are now being used in other fields for privacy-safe data sharing Sampling
8
Example: Health Care Data NameAgeZIPDisease Jane Smith1090110Heart disease Jane Doe1290121Flu John Smith9990222Cancer John Doe2290223Cancer Jack Smith2591211Heart disease Jack Doe2391222Flu
9
Anonymized Ages and ZIPs are all unique – Can be used for identification AgeZIPDisease 1090110Heart disease 1290121Flu 9990222Cancer 2290223Cancer 2591211Heart disease 2391222Flu
10
k-anonymity Hiding in a crowd Remove or aggregate data that pertains to fewer than k individuals, e.g. k=2 AgeZIPDisease 0-20901**Heart disease 0-20901**Flu 20-100902**Cancer 20-100902**Cancer 20-40912**Heart disease 20-40912**Flu People from 902** ZIP have cancer
11
l-diversity Remove or aggregate data that pertains to fewer than k individuals – But ensure that each group has at least l different values of sensitive attribute, e.g. l=2 AgeZIPDisease 0-209****Heart disease 0-209****Flu 20-1009****Cancer 20-1009****Heart disease 20-409****Cancer 20-409****Flu If one knows that Jane Doe is 12, they can learn from the table that she does not have cancer.
12
t-closeness The distance between the distribution of a sensitive attribute in a group and in the whole table is at most t, e.g., t=0 AgeZIPDisease 0-1009****Heart disease 0-1009****Flu 0-1009****Cancer 20-1009****Heart disease 20-1009****Cancer 20-1009****Flu Notice the drastic loss of information! Age or ZIP are all the same now and not correlated with disease.
13
Sampling Don’t release the entire data, just release sampled data – Data would still need to be anonymized – This way an individual is protected because any given record has a low chance to be sampled Attacks from previous slides do not work since attacker cannot ensure they saw all the data Attacker doesn’t know if a given individual has been selected – If an individual with a rare pattern is selected, they have no privacy protection
14
Differential Privacy Definition 1: Differential privacy goal. Any given disclosure will be, within a small multiplicative factor, just as likely whether or not the individual participates in the database. – Thus an individual does not lose anything by participating
15
Differential Privacy Definition 2: Differential privacy. A randomized algorithm M gives ε-differential privacy if for all data sets D1 and D2 differing on at most one element, and all S ⊆ Range(M), it holds that Pr[M(D1) ∈ S] ≤ exp(ε) × Pr[M(D2) ∈ S]. So probabilities of some query’s result on a dataset with and without any single record are within ε of each other
16
Differential Privacy Definition 3: The Laplace mechanism. Given any function f: N |X| → R k, the Laplace mechanism is defined as: M L (x, f (), ε) = f (x) + (Y 1,..., Y k ), where Y i are i.i.d. random variables drawn from Laplace distribution with mean (∆f/ε), and ∆f is l1-sensitivity of the function f, i.e., the biggest change in f that can occur by addition or removal of a single row in X. Note that ∆f is the property of the function f and not of the database X.
17
What Can We Do With Diff. Privacy? Run queries on data and return results, which are differentially private Use it with ML to learn rules, which do not leak data about any individual Generate synthetic data, which is differentially private
18
What is Diff. Privacy Good For? Single-Identity-Single-Record data – E.g., census, population data, birth and death records, data on people undergoing a given treatment Most frequent type of query – counts, histograms, e.g., SELECT count(*) FROM DB WHERE attribute=value ∆f=1, ε=0.1, added noise is a random number ∈ [1-10] – Doesn’t change the distribution much
19
What is Diff. Privacy Bad For? Single-Identity-Multiple-Record data – E.g., hospital visits, movie rentals, network traffic Heavy-tail data: A small number of individuals have a large contribution to data ∆f>>1, ε=0.1, added noise is a random number ∈ [1-HUGENum] – No utility left in the data
20
Online k-anonymity Avoids many pitfalls of k-anonymity by letting users run queries on data, and just returning the output – Each data point is checked to verify that it pertains to at least k identities – No need for l-diversity and t-closeness Data provider can restrict which queries can be ran on which data fields – E.g., no direct comparison on name field
21
Online k-anonymity The system can keep track about data flow and identify cases where one tries to pinpoint a group of fewer than k identities (tracker attacks) – E.g., count of people with salary<130K, count of people with salary<129K Hiding in a group of k people – Simple concept, easier to understand than how to properly set up ε value
22
What Is Still Difficult? Aggregate queries, which include heavy-tail data – E.g., sum of visits for cancer treatment, sum of packets on a given service port Heavy tail dominates the result, enables identification – E.g., if 5 hosts receive packets on port 80, with counts 100, 200, 300, 400 and 50,000 the last value dominates the sum We can solve this by testing for heavy-tail and enforcing k-anonymity on heavy tail
23
What Is Still Difficult? Should we omit or merge values? – Omit loses a lot of data – Merge might reveal sensitive data, e.g., merging ports 22 and 30 results in range 22-30, leaks data about edge values Better to merge within pre-defined ranges, also loses utility Heavy-tail data should be merged with different-sized ranges than uniformly distributed data
24
Critter@home Content-Rich Traffic Repository from Real-time Anonymous User Contributions – Users contribute their data to research – Data is recorded and remains at user machine (no loss of ownership, no liability issues) Researchers connect to Critter server – Ask queries in SQL-like language Users poll the server for queries, reply with yes/no or counts – Server uses k-anonymity to aggregate query responses
25
Critter Architecture
26
Query Process
27
Critter@home Current data: HTTP but not HTTPS Python-based client uses libpcap to collect packets – Can be easily extended to other applications Data is organized into connection and session tables, with lots of metadata
28
Critter@home Field NameDescription noAuto Increment Primary Key timestmpTime Stamp of the last TCP Packet assembled for HTTP content. page_idIdentifier for a text/html page and its objects tcp_session_idIdentifier for all HTTP content in a TCP session browsing_session_idIdentifer for TCP sessions linked together sourceSource_IP:Port destinationDest_IP:Port http_typeRequest/Response hostDomain Name urlRelative Path refererReferred By cookieCookie field for Request and 'Set Cookie' for Response. content_typetext/html, image/jpeg etc. no_childrenNumber of images payloadGZIP decoded ASCII payload hrefshrefs list iframesiframes list imagesimages list
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.