Foundations of Privacy Lecture 4 Lecturer: Moni Naor.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Eric Shou Stat/CSE 598B. What is Game Theory? Game theory is a branch of applied mathematics that is often used in the context of economics. Studies strategic.
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System ` Introduction With the deployment of smart card automated.
Order Statistics Sorted
Imbalanced data David Kauchak CS 451 – Fall 2013.
Fast Algorithms For Hierarchical Range Histogram Constructions
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Spring 2015 Lecture 5: QuickSort & Selection
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Limitations of VCG-Based Mechanisms Shahar Dobzinski Joint work with Noam Nisan.
Foundations of Privacy Lecture 3 Lecturer: Moni Naor.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
Private Information Retrieval. What is Private Information retrieval (PIR) ? Reduction from Private Information Retrieval (PIR) to Smooth Codes Constructions.
Foundations of Privacy Lecture 7 Lecturer: Moni Naor.
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
CS573 Data Privacy and Security Statistical Databases
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Privacy by Learning the Database Moritz Hardt DIMACS, October 24, 2012.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Yang Cai Oct 06, An overview of today’s class Unit-Demand Pricing (cont’d) Multi-bidder Multi-item Setting Basic LP formulation.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Geo-Indistinguishability: Differential Privacy for Location Based Services Miguel Andres, Nicolas Bordenabe, Konstantinos Chatzikokolakis, Catuscia Palamidessi.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Private Data Management with Verification
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Differential Privacy in Practice
Vitaly (the West Coast) Feldman
Nikhil Bansal, Shashwat Garg, Jesper Nederlof, Nikhil Vyas
Foundations of Privacy Lecture 7
Differential Privacy (2)
Probabilistic Databases
Published in: IEEE Transactions on Industrial Informatics
Presentation transcript:

Foundations of Privacy Lecture 4 Lecturer: Moni Naor

Recap of last week’s lecture Differential Privacy Sensitivity: – Global sensitivity of query q:U n → R d GS q = max D,D’ ||q(D) – q(D’)|| 1 – Local sensitivity of query q at point D LS q (D)= max D’ |q(D) – q(D’)| – Smooth sensitivity S f *(X)= max Y {LS f (Y) e -  dist(x,y) } Histograms Differential privacy of median Exponential Mechanism

Histograms Inputs x 1, x 2,..., x n in domain U Domain U partitioned into d disjoint bins S 1,…,S d q(x 1, x 2,..., x n ) = (n 1, n 2,..., n d ) where n j = #{i : x i in j-th bin} Can view as d queries: q i counts # spoints in set S i For adjacent D, D’, only one answer can change - it can change by 1 Global sensitivity of answer vector is 1 Sufficient to add Lap(1/ε) noise to each query, still get ε -privacy

The Exponential Mechanism [ McSherry Talwar] A general mechanism that yields Differential privacy May yield utility/approximation Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions Collusion resistance Compatibility

Side bar: Digital Goods Auction Some product with 0 cost of production n individuals with valuation v 1, v 2, … v n Auctioneer wants to maximize profit

Example of the Exponential Mechanism Data: x i = website visited by student i today Range: Y = {website names} For each name y, let q(y, X) = #{i : x i = y} Goal: output the most frequently visited site Procedure: Given X, Output website y with probability prop to e  q(y,X) Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset

Setting U n RFor input D 2 U n want to find r 2 R R -Base measure  on R - usually uniform U n £ R Score function q’: U n £ R  R assigns any pair (D,r) a real value –Want to maximize it (approximately) The exponential mechanism R –Assign output r 2 R with probability proportional to e  q’(D,r)  (r) Normalizing factor  r e  q’(D,r)  (r)

The exponential mechanism is private Let  = max D,D’,r |q(D,r)-q(D’,r)| Claim: The exponential mechanism yields a 2 ¢  ¢  differentially private solution Prob [output = r on input D ] = e  q’(D,r)  (r)/  r e  q’(D,r)  (r) Prob [output = r on input D’ ] = e  q’(D’,r)  (r)/  r e  q’(D’,r)  (r) adjacent Ratio is bounded by e 

Laplace Noise as Exponential Mechanism On query q:U n → R let q’(D,r) = -|q(D)-r| Prob noise = y e -  y / 2  y e -  y =  /2 e -  y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e -|y|/b y

Any Differentially Private Mechanism is an instance of the Exponential Mechanism Let M be a differentially private mechanism Take q’(D,r) to be log Prob[M(D) =r] Remaining issue: Accuracy

Private Ranking Each element i 2 {1, … n} has a real valued score S D (i) based on a data set D. Goal: Output k elements with highest scores. Privacy Data set D consists of n entries in domain D. – Differential privacy: Protects privacy of entries in D. Condition: Insensitive Scores –for any element i, for any data sets D, D’ that differ in one entry: |S D (i)- S D’ (i)| · 1

Approximate ranking Let S k be the k th highest score based on data set D. An output list is  -useful if: Soundness : No element in the output has score less than S k -  Completeness : Every element with score greater than S k +  is in the output. Score · S k -  S k +  · Score S k -  · Score · S k + 

Two Approaches Score perturbation –Perturb the scores of the elements with noise –Pick the top k elements in terms of noisy scores. –Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? Exponential sampling –Run the exponential mechanism k times. –more complicated and slower implementation What sort of guarantees? Each input affects all scores Homework

Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k}, each individual likes or dislikes each option ( 1 or 0 ) Goal: output a lunch option that many like For each lunch option j 2 [k], ℓ(j) is # of ind. who like j Exponential Mechanism: Output j with probability e εℓ(j) Actual probability: e εℓ(j) /(∑ i e εℓ(i) ) Normalizer

query 1, query 2,... Synthetic DB: Output is a DB Database answer 1 answer 3 answer 2 ? Sanitizer Synthetic DB: output also a DB (of entries from same universe X ), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers

Answering More Queries Using exponential mechanism Differential Privacy for every set C of counting queries Error is Õ(n 2/3 log|C|) Remarkable Hope for rich private analysis of small DBs! Quantitative: #queries >> DB size, Qualitative:output of sanitizer -synthetic DB- output is a DB itself

Counting Queries Queries with low sensitivity Counting-queries C is a set of predicates c: U  {0,1} Query : how many D participants satisfy c ? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis Assume all queries given in advance U Database D of size n Query c Non-interactive

Utility and Privacy Can’t Always Be Achieved Simultaneously Impossibility results for counting queries: DB with n participants can’t have o(√n) error, O(n) queries [DiNi, DwMcTa07,DwYe08] In all these cases, strong privacy violation What can we do? almost entire DB compromised

Huge DBs [Dwork Nissim] DB of size n >> # queries |C|: Add independent noise to answer on every query Noise per query ~ #queries For accuracy, need #queries ≤ n huge May be reasonable for huge internet-scale DBs, Privacy “for free”

What about smaller DBs? DB of size n < #queries |C|, impossibility results: can’t have o(√n) error Error must be Ω(√n)

The BLR Algorithm For DBs F and D dist(F,D) = max q 2 C |q(F) – q(D)| Intuition: far away DBs get smaller probability Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D) Blum Ligett Roth08

The BLR Algorithm Idea: In general: Do not use large DB –Sample and answer accordingly DB of size m guaranteeing hitting each query with sufficient accuracy

The BLR Algorithm: 2 ε -Privacy For adjacent D, D’ for every F |dist(F,D) – dist(F,D’)| ≤ 1 Probability of F by D : e -ε·dist(F,D) /∑ G of size m e -ε·dist(G,D) Probability of F by D’ : numerator and denominator can change by e ε -factor  2ε -privacy Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)

The BLR Algorithm: Error Õ(n 2/3 log|C|) There exists F good of size m =Õ((n\α) 2· log|C|) s.t. dist(F good,D) ≤ α Pr [F good ] ~ e -εα For any F bad with dist 2α, Pr [F bad ] ~ e -2εα Union bound: ∑ bad DB F bad Pr [F bad ] ~ |U| m e -2εα For α=Õ(n 2/3 log|C|), Pr [F good ] >> ∑ Pr [F bad ] Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)

The BLR Algorithm: Running Time Generating the distribution by enumeration: Need to enumerate every size- m database, where m = Õ((n\α) 2· log|C|) Running time ≈ |U| Õ((n\α) 2 ·log|c|) Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)

Conclusion Offline algorithm, 2ε- Differential Privacy for any set C of counting queries Error α is Õ(n 2/3 log|C|/ε) Super-poly running time: |U| Õ((n\α) 2 ·log|C|)

Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY The bad news cannot do much better, namely sanitize in time: sub-poly(|C|) AND sub-poly(|U|) Time poly(|C|,|U|)

How Efficiently Can We Sanitize? |C| |U| subpolypoly subpoly poly ? Good news! ? ??

The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C DB size n ¸ Õ(|C| o(1) log|U|) error is ~ n 2/3 Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)

Recursive Algorithm C 0 =CC1C1 C2C2 CbCb Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor

Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set C b Output is good for all queries in small set C i+1 Extract utility on almost-all queries in large set C i Fix remaining “underprivileged” queries in large set C i C 0 =CC1C1 C2C2 CbCb