When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke
When small data is better data private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke
The user data “understanding” There is a tacit understanding among users that if you send data to a company, they are free to use it how they wish OK LESS OK NOT OK Facebook “knowing” all kinds of personal information Doubleclick monitoring your browsing behavior Google gathering WLAN traffic during drive-by
Leads to data gathering services Companies build (free) services designed to gather as much data about users as they can And often secretly gather data about users when they can’t Then try to monetize that data Mainly through advertising Though Jean Bolot had some interesting ideas
Leads to “big data” (mining) Companies gather what they can, but don’t always get what they want Google knows your searches, but not your relationship status Facebook knows your relationship status, but not what you buy Amazon knows what you buy, but not what you search for So they use big data mining to infer what they don’t know
A new user data understanding It is ok to monetize (or otherwise benefit) from user data if: The user data is very expensive to collected in any identifiable form Users can know what is going on, and users can opt-out
Why is this interesting? Keeping user data on the user device is the key to user privacy Most user data is at, or has passed through, the user device Search and browsing in browser history Facebook user profile easily scraped Amazon purchases easily scraped
Premise of “Private by Design” If we can monetize user data, without collecting user data, then we have legitimate access to far more user data Less need to deal with big data Better monetization, less overhead
My group’s research agenda “Private by Design” behavioral advertising “Private by Design” aggregate analytics
My group’s research agenda “Private by Design” behavioral advertising “Private by Design” aggregate analytics
Aggregate Analytics Web analytics: want to know demographics of user base, what other websites users visit, etc. App analytics: want to know what other apps user runs (competitors) Mobile analytics, general analytics,….
Typical database privacy settings: trusted component sees database Analyst Untrusted Analyst query query Database’ Query Module (add noise) query anonymize Traditional differential privacy assumes a centralized database front-ended by a trusted query module. There is, however, no centralized database existing in a distributed setting with individual users maintaining their own data. Some form of distributed differential privacy is therefore required. Trusted Database Database 13
Our setting: nobody (except user) sees individual user data Untrusted Data Analyst ? ? ? Untrusted Traditional differential privacy assumes a centralized database front-ended by a trusted query module. There is, however, no centralized database existing in a distributed setting with individual users maintaining their own data. Some form of distributed differential privacy is therefore required. 14
Previous work in our setting Assumed differential privacy Poor scaling characteristics, and/or Could not tolerate user fraud Data Analyst ? ? ? Our goal: Assume differential privacy, but fix scaling and user fraud problems.
Differential privacy Differential privacy adds noise to the output of a computation (i.e., query). Database Analyst Query Module (add noise) After a lot of effort in the past decade or two, an approach called Another way to protect users’ privacy is to add noise to the output of a computation, instead of adding noise to the original user data DB1 DB2 (differs by one user) 16
Components & assumptions Analyst Analyst is potentially malicious (violating user privacy) Proxy is honest but curious 1) Follows the specified protocol (does not collude) 2) Tries to exploit additional info that can be learned in so doing Proxy (add DP noise blindly) Data Data Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Clients are user devices. Clients are potentially malicious (distorting the final results) 17
Actually, two proxies! Honest-but-Curious proxy must not see user data If one proxy, need expensive public key encryption between clients and analyst If two proxies, can use much cheaper form of encryption (one time pad) Analyst Blind Proxy Blind Proxy Data Data Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. 18
Message XOR Random_String = Result Proxy 1 Result Sender Receiver Result Random_String Proxy 2 Random_String Result XOR Random_String = Message
Queries are counting queries: Analyst Queries are counting queries: Ex: How many users…..are male and between ages of 10-20? Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 20
Clients answer ‘yes’ or ‘no’ only Analyst Clients answer ‘yes’ or ‘no’ only Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 21
Proxies adds N additional random yes/no answers (coins) Analyst Proxies adds N additional random yes/no answers (coins) N = 2σ2 But, must not know how many yes’s and no’s it added! Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 22
Each proxy independently adds N random coins Analyst Each proxy independently adds N random coins XOR at analyst will produce random result But neither proxy knows what the result will be Blind Proxy Blind Proxy Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 23
Coins and answers Analyst Blind Proxy Blind Proxy Data Data Data 24 The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 24
Decrypt and tabulate Analyst Blind Proxy Blind Proxy Data Data Data 25 The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 25
Buckets Not “is your age between 10-20?”, but “are you 1?”, “are you 2?”, “are you 3?”…. Query is generally a vector of yes/no questions Answer a vector of 1’s and 0’s Vector can be big: List of 20K websites 185K combinations of 10 of 20 attributes
Proxies add coins and shuffle user answers (per bucket) Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Proxies add coins and shuffle user answers (per bucket) Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Data The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data 27
b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 28
The shuffling at each proxy must be identical (though random) Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. The shuffling at each proxy must be identical (though random) Because each bit must be paired with its XOR partner The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 29
But the proxies may have a (slightly) different set of answers. Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. But the proxies may have a (slightly) different set of answers. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 30
Synchronize the list of answers. Analyst u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Synchronize the list of answers. Share a random seed for a random number generator, use to shuffle. The PDDP system consists of three components: analysts, clients, and proxy. Analysts make queries to the system, and collect answers. Clients locally maintain their own data, and answer queries. The proxy mediates between the analysts and clients, and adds differentially private noise to clients’ answers to preserve privacy. Analysts are assumed to be potentially malicious, with a goal of violating individual users’ privacy. An analyst may collude with other analysts, or pretend to be multiple distinct analysts. An analyst may take control of clients, and attempt to use the PDDP protocol to reveal information about those clients. An analyst may deploy its own clients and manipulate their answers. An analyst may also publish its collected answers. Analysts can intercept and modify all messages (e.g., an ISP posing as an analyst). Clients are also assumed to be potentially malicious, with a goal of distorting the statistical results learned by analysts. Clients may generate false or illegitimate answers under coordinated control (e.g., as a botnet), and may act as Sybils [11]. The proxy is assumed to be honest but curious (HbC). It will faithfully follow the specified protocol, but may try to exploit additional information that can be learned in so doing. The proxy does not collude with other components. We discuss how we may be able to relax the HbC assumption by using trusted hardware in §6. Data Data Data 31
Time Queries (unfortunately) take time: There is a period of time during which a query is active 10s of minutes, hours, or days??? Start query Synchronize and add coins TIME Clients pull in and answer queries
Differential Privacy, good and bad Adds noise Lots of machinery being built Bad: Very pessimistic (measure of privacy loss is almost certainly way worse than actual privacy loss) “Throwing away the database” not realistic
From INTIMATE workshop Jean’s mobility a good application Collaborative filtering (Bach, Aruna) looks hard to do Serge’s social knowledge may be centered on user devices… Query for people’s opinions… Real-time analytics may be possible Streamed coin addition???
Status and future Building an application analytics tool Initial focus is PC platforms Hope to get real app developers to bundle our tool Additional privacy mechanisms (beyond differential privacy) Work on better understanding of privacy loss in a realistic setting