Presentation is loading. Please wait.

Presentation is loading. Please wait.

When small data is better data

Similar presentations


Presentation on theme: "When small data is better data"— Presentation transcript:

1 When small data is better data
Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke

2 When small data is better data
private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke

3 The user data “understanding”
There is a tacit understanding among users that if you send data to a company, they are free to use it how they wish OK LESS OK NOT OK Facebook “knowing” all kinds of personal information Doubleclick monitoring your browsing behavior Google gathering WLAN traffic during drive-by

4 Leads to data gathering services
Companies build (free) services designed to gather as much data about users as they can And often secretly gather data about users when they can’t Then try to monetize that data Mainly through advertising Though Jean Bolot had some interesting ideas

5 Leads to “big data” (mining)
Companies gather what they can, but don’t always get what they want Google knows your searches, but not your relationship status Facebook knows your relationship status, but not what you buy Amazon knows what you buy, but not what you search for So they use big data mining to infer what they don’t know

6 A new user data understanding
It is ok to monetize (or otherwise benefit) from user data if: The user data is very expensive to collected in any identifiable form Users can know what is going on, and users can opt-out

7 Why is this interesting?
Keeping user data on the user device is the key to user privacy Most user data is at, or has passed through, the user device Search and browsing in browser history Facebook user profile easily scraped Amazon purchases easily scraped

8 Premise of “Private by Design”
If we can monetize user data, without collecting user data, then we have legitimate access to far more user data Less need to deal with big data Better monetization, less overhead

9 My group’s research agenda
“Private by Design” behavioral advertising “Private by Design” aggregate analytics

10 My group’s research agenda
“Private by Design” behavioral advertising “Private by Design” aggregate analytics

11 Aggregate Analytics Web analytics: want to know demographics of user base, what other websites users visit, etc. App analytics: want to know what other apps user runs (competitors) Mobile analytics, general analytics,….

12 Typical database privacy settings: trusted component sees database
Analyst Untrusted Analyst query query Database’ Query Module (add noise) query anonymize Traditional differential privacy assumes a centralized database front-ended by a trusted query module. There is, however, no centralized database existing in a distributed setting with individual users maintaining their own data. Some form of distributed differential privacy is therefore required. Trusted Database Database 13

13 Our setting: nobody (except user) sees individual user data
Untrusted Data Analyst ? ? ? Untrusted Traditional differential privacy assumes a centralized database front-ended by a trusted query module. There is, however, no centralized database existing in a distributed setting with individual users maintaining their own data. Some form of distributed differential privacy is therefore required. 14

14 Previous work in our setting
Assumed differential privacy Poor scaling characteristics, and/or Could not tolerate user fraud Data Analyst ? ? ? Our goal: Assume differential privacy, but fix scaling and user fraud problems.

15 Differential privacy Differential privacy adds noise to the output of a computation (i.e., query). Database Analyst Query Module (add noise) After a lot of effort in the past decade or two, an approach called Another way to protect users’ privacy is to add noise to the output of a computation, instead of adding noise to the original user data DB1 DB2 (differs by one user) 16

16 Components & assumptions
Analyst Analyst is potentially malicious (violating user privacy) Proxy is honest but curious 1) Follows the specified protocol (does not collude) 2) Tries to exploit additional info that can be learned in so doing Proxy (add DP noise blindly) Data Data Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Clients are user devices. Clients are potentially malicious (distorting the final results) 17

17 Actually, two proxies! Honest-but-Curious proxy must not see user data
If one proxy, need expensive public key encryption between clients and analyst If two proxies, can use much cheaper form of encryption (one time pad) Analyst Blind Proxy Blind Proxy Data Data Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. 18

18 Message XOR Random_String = Result
Proxy 1 Result Sender Receiver Result Random_String Proxy 2 Random_String Result XOR Random_String = Message

19 Queries are counting queries:
Analyst Queries are counting queries: Ex: How many users…..are male and between ages of 10-20? Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 20

20 Clients answer ‘yes’ or ‘no’ only
Analyst Clients answer ‘yes’ or ‘no’ only Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 21

21 Proxies adds N additional random yes/no answers (coins)
Analyst Proxies adds N additional random yes/no answers (coins) N = 2σ2 But, must not know how many yes’s and no’s it added! Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 22

22 Each proxy independently adds N random coins
Analyst Each proxy independently adds N random coins XOR at analyst will produce random result But neither proxy knows what the result will be Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 23

23 Coins and answers Analyst Blind Proxy Blind Proxy Data Data Data 24
The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 24

24 Decrypt and tabulate Analyst Blind Proxy Blind Proxy Data Data Data 25
The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 25

25 Buckets Not “is your age between 10-20?”,
but “are you 1?”, “are you 2?”, “are you 3?”…. Query is generally a vector of yes/no questions Answer a vector of 1’s and 0’s Vector can be big: List of 20K websites 185K combinations of 10 of 20 attributes

26 Proxies add coins and shuffle user answers (per bucket)
Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Proxies add coins and shuffle user answers (per bucket) Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 27

27 b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, ……
Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 28

28 The shuffling at each proxy must be identical (though random)
Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. The shuffling at each proxy must be identical (though random) Because each bit must be paired with its XOR partner The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 29

29 But the proxies may have a (slightly) different set of answers.
Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. But the proxies may have a (slightly) different set of answers. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 30

30 Synchronize the list of answers.
Analyst u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Synchronize the list of answers. Share a random seed for a random number generator, use to shuffle. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 31

31 Time Queries (unfortunately) take time:
There is a period of time during which a query is active 10s of minutes, hours, or days??? Start query Synchronize and add coins TIME Clients pull in and answer queries

32 Differential Privacy, good and bad
Adds noise Lots of machinery being built Bad: Very pessimistic (measure of privacy loss is almost certainly way worse than actual privacy loss) “Throwing away the database” not realistic

33 From INTIMATE workshop
Jean’s mobility a good application Collaborative filtering (Bach, Aruna) looks hard to do Serge’s social knowledge may be centered on user devices… Query for people’s opinions… Real-time analytics may be possible Streamed coin addition???

34 Status and future Building an application analytics tool
Initial focus is PC platforms Hope to get real app developers to bundle our tool Additional privacy mechanisms (beyond differential privacy) Work on better understanding of privacy loss in a realistic setting


Download ppt "When small data is better data"

Similar presentations


Ads by Google