Download presentation
Presentation is loading. Please wait.
Published byまいえ かつま Modified over 5 years ago
1
Stochastic Privacy Adish Singla Eric Horvitz Ece Kamar Ryen White
--- Good Morning everyone. Today, I am going to present our work on Stochastic Privacy, carried out during an internship at Microsoft Research with Eric Horvitz, Ece Kamar and Ryen White. Adish Singla Eric Horvitz Ece Kamar Ryen White AAAI’14 | July 30, 2014
2
Stochastic Privacy Bounded, small level of privacy risk—the likelihood of data being accessed and harnessed in a service. Procedures (and proofs) work within bound to acquire data from population needed to provide quality of service for all users. --- We introduce a probabilistic approach to accessing user data, what we refer to as Stochastic Privacy. Each user chooses a bounded, small level of privacy risk, that represents the bound on the likelihood that the data will be accessed and harnessed by service. --- We present procedures with proofs that operate within these privacy bounds, to acquire data from population to provide quality of service for all users.
3
Stochastic Privacy Probabilistic approach to sharing data
Guaranteed upper bound on likelihood data is accessed Users offered/choose small privacy risk (e.g, r < ) Small probabilities of sharing may be tolerable to users e.g., We all live with possibility of a lightning strike Different formulations of events/data shared e.g., Over time or volume of user data Encoding user preferences Users choose r in some formulations System may offer incentives e.g. Discount on software, premium service --- Within our framework, users are offered or choose a small privacy risk, such as one-in-million chance. The key idea is that small probabilities of sharing may be tolerate to users such as rare events of lightening strike. --- Our notion of stochastic privacy can be used in different formulations, such as privacy risk over time or over volume of user data. --- Privacy risk may be offered to users, or they can choose the desired risk r. System may also offer incentives to users to take on higher risk.
4
Background: Online Services & Data
Accessing User Data Online services often rely on user data Search engines, recommender systems, social networks Click logs, personal data, e.g., demographics and location Personalization of services, increased revenue User Consent Request permission or declare policy at sign-up / real time Typically: Binary response sought (yes/no) Terms of services with opt-in or opt-out Sets of questions: “May I access location to enhance services?” Response often: “Hmm…sounds good. Yes, sure!” --- Online services typically rely on collecting user data, such as click logs, or demographics information, which is then used for personalization of services and increased revenue. --- User content is often sought by requesting permission or declaring a policy at sign-up or real-time. Users response is binary or opt-in or not. --- Terms of services shown are often complicated and difficult to understand, and users often make an uninformed decision. --- Many of you have probably faced this question being asked by your smartphone – “May I access location to enhance services?” – How many of you have said yes, without understanding implications?
5
Privacy Concerns and Approaches
Privacy Concerns and Data Leakage Sharing or selling to third parties or malicious attacks Unusual scenarios, e.g., AOL data release (2005) Privacy advocates and government institutions FTC charges against major online companies (2011, 2012) Approaches Understand and design for user preferences as top priority Type of data that can be logged (Olson, Grudin, and Horvitz 2005) Privacy—utility tradeoff (Krause and Horvitz 2008) Granularity and level of identifiability — k-anonymity, differential-privacy Desirable Characteristics User preferences as top priority Controls, clarity for users Allow larger systems to optimize under the privacy preferences of users There are growing privacy concerns, related to data leakage and breaches of information. This could arise from sharing or selling of data to third parties, or malicious attacks. Also there could be unusual situations as from the AOL data release as well as de-anonymization of Netflix logs. Different approaches to privacy have been proposed such as to understand user preferences about what type of data can be logged. Also, approaches to do utility-privacy tradeoff under the preferences of users. Frameworks have been proposed to such as k-anonymity and differential privacy, to understand the level of idenfiability of users in such systems. As the desirable characteristics for framework to access user data, preferences of the users should be top priority, as well as users should have control and clarity. Also, the larger systems should be able to optimize the servicses under the constraints imposed by privacy preferences of users.
6
Overall Architecture User preferences – choosing risk r, offer incentives System preferences – application utility Optimization – user sampling while managing privacy risk Explorative sampling (e.g, learning user’s utilities) Selective sampling (e.g, utility-driven user selection) Engagement (optional) on incentive offers (e.g, engage users with option/incentive to take on higher privacy risk) We present one complete architecture of implementing stochastic privacy. The three main components we envision here include user-preference component, a system preference and system’s utility component and the main optimization component that focuses on user sampling, as well as managing privacy risk. In the paper, we focus specifically on designing procedures for utility-driven user-selection under the constraints of stochastic privacy.
7
Optimization: Selective Sampling
Example: Location based personalization Population Privacy risk , cost Observed attributes Under budget or cardinality Select set Utility Risk of sampling by mechanism Class of Utility Functions Consider submodular utility functions Capture notion of diminishing returns Suitable for various applications such as maximizing click coverage Near-optimal polynomial-time solutions (Nemhauser’78) --- Let us consider the application of location based personalization, where we are given a population of users associated with risk, cost and some observed attributes such as geo-location. --- There is a fixed budget on how many users we can select, and goal is to select a small set of users S, to mazimize the utility. The key challenge is to ensure that likelihood of any user being sampled is less than the opted privacy risk rate. --- We consider a class of submodular utility functions here, that capture the notion of diminishing returns and suitable for various applications such as maximizing click coverage. Furthermore, near optimal solutions polynomial-time solutions can be found for such functions, ignoring the privacy constraints.
8
Sampling Procedures Desirable Properties:
Competitive utility Privacy guarantees Polynomial runtime OPT: Even ignoring privacy risk constraints: solution is intractable (Feige’98) GREEDY: Selection based on maximal marginal utility by cost ratio (Nemhauser’78) RANDOM: Selecting users randomly Competitive Utility Privacy Guarantees Polynomial Runtime OPT ✔ ✗ GREEDY RANDOM Competitive Utility Privacy Guarantees Polynomial Runtime OPT ✔ ✗ GREEDY Competitive Utility Privacy Guarantees Polynomial Runtime OPT ✔ ✗ Competitive Utility Privacy Guarantees Polynomial Runtime OPT ✔ ✗ GREEDY RANDOM RANDGREEDY SPGREEDY --- The desirable characteristics of a sampling procedure include competitive utility, privacy guarantees and polynomial runtime. --- In light of this, if we consider the combinatorial optimal solution to the problem ignoring privacy risk constraints, the solution is intractable. --- Then, the greedy solutions based on maximizing marginal utility by cost ration can obtain competitive utilities, however violate privacy guarantees. --- On the other extreme is random sampling. --- We design two mechanisms RandGreedy and SPGreedy that combine the idea ideas of greedy selection and random selection to achieve all the desirable properties.
9
Procedure II: SPGREEDY
Greedy selection followed by obfuscation in batches Iteratively builds the solution as follows: Greedily selects Obfuscates with “similar” users to create set Randomly samples Removes the entire set for further consideration --- We now present the details of one of our procedure SPGreedy. The idea is to combine greedy selection followed by obfuscation. It iteratively builds the solution as follows. Greedily select one user, obfuscate it with “similar” users to create a batch Si, randomly sample one user from this batch and remove the entire batch for further consideration. ---- And, continue this loop.
10
Analysis: General Case
Worst-case upper bound Lower bound for any distribution --- We analyze the problem of selective sampling with privacy constraints in general setting and prove that in general, no procedure can achieve competitive utility in the worst case scenario. --- The question we explore is whether we can perform better for more realistic scenarios. GREEDY Coin flip with r probability Or Can we achieve better performance bounds for realistic scenarios?
11
Additional Structural Properties
Smoothness Diversification --- Motivated by this, we consider utility function f with two additional characteristics. One is the notion of smoothness or continuity given by lambda-f, that perturbing the solution set doesn't’t change the drastically change the utility value. --- Second is the notion of diversification, that captures that the value of adding new user will be higher if it more dissimilar to the already selected users.
12
Main Theoretical Results
RANDGREEDY SPGREEDY RANDGREEDY vs. SPGREEDY RANDGREEDY doesn’t requires additional assumption of diversification SPGREEDY bounds always hold, whereas RANDGREEDY bounds are probabilistic SPGREEDY has smaller constants in the bounds compared to --- Under these two structural properties, to summarize these results, we prove bounds on our procedures, that are competitive with only additional additive factors compared. --- comparing RandGreedy and SPGreedy, we see that RandGreedy doesn’t require additional assumption of diversification, however, on positive side, SPGreedy bounds always hold, whereas RandGreedy bounds are only probablistic. --- For practical purposes, the constants that appear in the bounds are smaller for SPGreedy.
13
Case study of web search personalization
Location based personalization for queries issued for specific domain Business (e.g. real-estate, financial services) Search logs from Oct’2013, restricted to 10 US states (7 million users) Observed attributes of users prior to sampling Meta-data about geo-coordinates Meta-data about last 20 search-result clicks (to infer expert profile) ODP (Open Directory Project) used to assign expert profile (White, Dumais, and Teevan 2009) --- We evaluated our procedures on a case study of web search personalization. --- Goal is to provide location-based personzalition service to the population by using click logs from the experts. ---- We used search logs of month month, which comprised of 7 million users. Each user was associated with set of observed attribuets, including geo-location as well as last 20 search-result clicks to infer the utility of those users. ---- We used ODP to assign each user with expert in the domain of interest.
14
Results: Varying Budget
As first result, we vary the number of selected users and see that both of our procedures are competitive w.r.t the greedy. And naïve baselines perform poorly. Both RANDGREEDY and SPGREEDY are competitive w.r.t. GREEDY Naïve baseline RANDOM perform poorly
15
Results: Varying Privacy Risk
Then, to compare the robustness of the procedures, we decrease the privacy risk r of the population. Both procedures are robust, with SP-greedy performing better compared to RandGreedy procedure. Performance of both RANDGREEDY and SPGREEDY degrades smoothly with decreasing privacy risk (i.e. tighter sampling constraint)
16
Results: Analyzing Performance
Thirdly, we analyze the workings of SPGreedy procedure to understand its robustness. We notice that it is robsut as interplay of two factors: In the beginning, the noise added by obfuscation is low and hence doesn’t lead to high error. With time, the relative loss from obfuscation increases, however the marginal utilities are decreasing keeping the overall loss low. Absolute obfuscation loss remains low Relative error of obfuscation increase However, marginal utilities decrease because of submodularity
17
Summary Introduced stochastic privacy: Probabilistic approach to data sharing Tractable end-to-end system for implementing a version of stochastic privacy in online services Procedures RANDGREEDY and SPGREEDY for sampling users under constraints on privacy risk Theoretical guarantees on acquired utility Results of independent interest for other applications Evaluation of proposed procedures on a case study of personalization in web search --- To summarize, we introduced Stochastic Privacy, as a new probabilistic approach to data sharing. --- We presented an end-to-end system for implementing a version of stochastic privacy. --- We presented procedures for utility driven sampling under the privacy constraints – with steong theoretical bounds and these procedures are of independent interest for other applications as well. --- Finally, we ebaluated our procedures on a case study of personalziation in web search data using logs from search engine. --- Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.