Foundations of Privacy Lecture 8 Lecturer: Moni Naor.

Slides:



Advertisements
Similar presentations
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Advertisements

Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Applied Algorithmics - week7
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Fast Algorithms For Hierarchical Range Histogram Constructions
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Mining Data Streams.
Online learning, minimizing regret, and combining expert advice
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Discrete Structure Li Tak Sing( 李德成 ) Lectures
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Planning under Uncertainty
Lecture04 Data Compression.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Foundations of Privacy Lecture 10 Lecturer: Moni Naor.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Sketching in Adversarial Environments Or Sublinearity and Cryptography 1 Moni Naor Joint work with: Ilya Mironov and Gil Segev.
Lower and Upper Bounds on Obtaining History Independence Niv Buchbinder and Erez Petrank Technion, Israel.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Shuchi Chawla, Carnegie Mellon University Static Optimality and Dynamic Search Optimality in Lists and Trees Avrim Blum Shuchi Chawla Adam Kalai 1/6/2002.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Foundations of Privacy Lecture 7 Lecturer: Moni Naor.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Lecture 10: Search Structures and Hashing
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
Database Management 9. course. Execution of queries.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Scheduling policies for real- time embedded systems.
Introduction to Algorithms Jiafen Liu Sept
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Privacy of Dynamic Data: Continual Observation and Pan Privacy Weizmann Institute of Science (visiting Princeton) Moni Naor Based on joint papers with:
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Private Data Management with Verification
Post-Modern Private Data Analysis
Better Algorithms for Better Computers
New Characterizations in Turnstile Streams with Applications
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Approximating the MST Weight in Sublinear Time
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Static Optimality and Dynamic Search Optimality in Lists and Trees
Lecture 18: Uniformity Testing Monotonicity Testing
Spatial Online Sampling and Aggregation
Differential Privacy in Practice
Randomized Algorithms CS648
Part-D1 Priority Queues
Rank Aggregation.
Presentation transcript:

Foundations of Privacy Lecture 8 Lecturer: Moni Naor

Recap of last week’s lecture Counting Queries –Online algorithm with good accuracy –Started changing data

What if the data is dynamic? Want to handle situations where the data keeps changing –Not all data is available at the time of sanitization Curator/ Sanitizer

Google Flu Trends “ We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real- time.”

Example of Utility: Google Flu Trends

What if the data is dynamic? Want to handle situations where the data keeps changing –Not all data is available at the time of sanitization Issues When does the algorithm make an output? What does the adversary get to examine? How do we define an individual which we should protect? D+Me Efficiency measures of the sanitizer

Data Streams Data is a stream of items Sanitizer sees each item and updates internal state. Produces output: either on-the-fly or at the end state Sanitizer Data Stream output

Three new issues/concepts Continual Observation –The adversary gets to examine the output of the sanitizer all the time Pan Privacy –The adversary gets to examine the internal state of the sanitizer. Once? Several times? All the time? “User” vs. “Event” Level Protection –Are the items “singletons” or are they related

Randomized Response Randomized Response Technique [Warner 1965] –Method for polling stigmatizing questions –Idea: Lie with known probability. Specific answers are deniable Aggregate results are still valid The data is never stored “in the plain” 1 noise … “trust no-one” Popular in DB literature Mishra and Sandler.

The Dynamic Privacy Zoo Differentially Private Continual Observation Pan Private User level Private User-Level Continual Observation Pan Private Petting Randomized Response

Continual Output Observation Data is a stream of items Sanitizer sees each item, updates internal state. Produces an output observable to the adversary state Output Sanitizer

Continual Observation Alg - algorithm working on a stream of data –Mapping prefixes of data streams to outputs –Step i output  i Alg is ε -differentially private against continual observation if for all –adjacent data streams S and S’ –for all prefixes t outputs  1  2 …  t Pr[Alg(S)=  1  2 …  t ] Pr[Alg(S’)=  1  2 …  t ] ≤ e ε ≈ 1+ ε e-ε ≤e-ε ≤ Adjacent data streams : can get from one to the other by changing one element S= acgtbxcde S’= acgtbycde

The Counter Problem 0/1 input stream Goal : a publicly observable counter, approximating the total number of 1 ’s so far Continual output: each time period, output total number of 1 ’s Want to hide individual increments while providing reasonable accuracy

Counters w. Continual Output Observation Data is a stream of 0/1 Sanitizer sees each x i, updates internal state. Produces a value observable to the adversary state 1112 Output Sanitizer

Counters w. Continual Output Observation Continual output: each time period, output total 1 ’s Initial idea: at each time period, on input x i 2 {0, 1} Update counter by input x i Add independent Laplace noise with magnitude 1/ε Privacy: since each increment protected by Laplace noise – differentially private whether x i is 0 or 1 Accuracy: noise cancels out, er ror Õ(√ T ) For sparse streams: this error too high. T – total number of time periods

Why So Inaccurate? Operate essentially as in randomized response – No utilization of the state Problem: we do the same operations when the stream is sparse as when it is dense –Want to act differently when the stream is dense The times where the counter is updated are potential leakage

Main idea: update output value only when large gap between actual count and output Have a good way of outputting value of counter once: the actual counter + noise. Maintain Actual count A t (+ noise ) Current output out t (+ noise) Delayed Update s D – update threshold

Out t - current output A t - count since last update. D t - noisy threshold If A t – D t > fresh noise then Out t+1  Out t + A t + fresh noise A t+1  0 D t+1  D + fresh noise Noise: independent Laplace noise with magnitude 1/ε Accuracy: For threshold D : w.h.p update about N/D times Total error: (N/D) 1/2 noise + D + noise + noise Set D = N 1/3  accuracy ~ N 1/3 Delayed Output Counter delay

Privacy of Delayed Output Protect: update time and update value For any two adjacent sequences Can pair up noise vectors  1  2  k-1  k  k+1  1  2  k-1  ’ k  k+1 Identical in all locations except one  ’ k =  k +1 Where first update after difference occurred Prob ≈ e ε Out t+1  Out t +A t + fresh noiseA t – D t > fresh noise, D t+1  D + fresh noise D t D’ t

Dynamic from Static Run many accumulators in parallel: –each accumulator: counts number of 1's in a fixed segment of time plus noise. –Value of the output counter at any point in time : sum of the accumulators of few segments Accuracy: depends on number of segments in summation and the accuracy of accumulators Privacy: depends on the number of accumulators that a point influences Accumulator measured when stream is in the time frame Only finished segments used xtxt Idea: apply conversion of static algorithms into dynamic ones Bentley-Saxe 1980

The Segment Construction Based on the bit representation: Each point t is in d log t e segments  i=1 t x i - Sum of at most log t accumulators By setting  ’ ¼  / log T can get the desired privacy Accuracy: With all but negligible in T probability the error at every step t is at most O (( log 1.5 T)/  )). canceling

Synthetic Counter Can make the counter synthetic Monotone Each round counter goes up by at most 1 Apply to any monotone function

Lower Bound on Accuracy Theorem : additive inaccuracy of log T is essential for  -differential privacy, even for  =1 Consider: the stream 0 T compared to collection of T/b streams of the form 0 jb 1 b 0 T-(j+1)b S j = Call output sequence correct: if a b/3 approximation for all points in time b

b=1/2log T,  =1/2 …Lower Bound on Accuracy Important properties For any output: ratio of probabilities under stream S j and 0 T should be at least e -  b –Hybrid argument from differential privacy Any output sequence correct for at most one S j or 0 T –Say probability of a good output sequence is at least  S j = Good for S j Prob under 0 T : at least  e -  b T/b  e -  b · 1-  contradiction b/3 approximation for all points in time

Hybrid Proof Want to show that for any event B Pr[A(S j ) 2 B] e-εb ≤e-εb ≤ Pr[A(0 T ) 2 B] Pr[A(S ji+1 ) 2 B] e-ε ≤e-ε ≤ Pr[A(S ji ) 2 B] Let S ji =0 jb 1 i 0 T-jb-i S j0 =0 T S jb =S j Pr[A(S jb ) 2 B] ¸ e -εb Pr[A(S j0 ) 2 B] = Pr[A(S j1 ) 2 B] Pr[A(S j0 ) 2 B] Pr[A(S jb ) 2 B] Pr[A(S jb-1 ) 2 B] …..

What shall we do with the counter? Privacy-preserving counting is a basic building block in more complex environments General characterizations and transformations Event-level pan-private continual-output algorithm for any low sensitivity function Following expert advice privately Track experts over time, choose who to follow Need to track how many times each expert was correct

Following Expert Advice n experts, in every time period each gives 0/1 advice pick which expert to follow then learn correct answer, say in 0/1 Goal: over time, competitive with best expert in hindsight Expert 1 Expert 2 Correct Expert Hannan 1957 Littlestone Warmuth 1989

Expert 1 Expert 2 Correct Expert Following Expert Advice n experts, in every time period each gives 0/1 advice pick which expert to follow then learn correct answer, say in 0/1 Goal: over time, competitive with best expert in hindsight Goal: #mistakes of chosen experts ≈ #mistakes made by best expert in hindsight Want 1+o(1) approximation

n experts, in every time period each gives 0/1 advice pick which expert to follow then learn correct answer, say in 0/1 Goal: over time, competitive with best expert in hindsight New concern: protect privacy of experts’ opinions and outcomes User-level privacy Lower bound, no non-trivial algorithm Event-level privacy counting gives 1+o(1) -competitive Following Expert Advice, Privately Was the expert consulted at all?

Algorithm for Following Expert Advice Follow perturbed leader [Kalai Vempala] For each expert: keep perturbed # of mistakes follow expert with lowest perturbed count Idea: use counter, count privacy-preserving #mistakes Problem: not every perturbation works need counter with well-behaved noise distribution Theorem [Follow the Privacy-Perturbed Leader] For n experts, over T time periods, # mistakes is within ≈ poly(log n,log T,1/ε) of best expert

List Update Problem There are n distinct elements A={a 1, a 2, … a n } Have to maintain them in a list – some permutation –Given a request sequence: r 1, r 2, … Each r i 2 A –For request r i : cost is how far r i is in the current permutation –Can rearrange list between requests Want to minimize total cost for request sequence –Sequence not known in advance Our goal: do it while providing privacy for the request sequence, assuming list order is public for each request r i : cannot tell whether r i is in the sequence or not

List Update Problem In general: cost can be very high First problem to be analyzed in the competitive framework by Sleator and Tarjan (1985) Compared to the best algorithm that knows the sequence in advance Best algorithms: 2- competitive deterministic Better randomized ~ 1.5 Assume free rearrangements between request Bad news: cannot be better than (1/  )- competitive if we want to keep privacy Cannot act until 1/  requests to an element appear

Lower bound for Deterministic Algorithms Bad schedule: always ask for the last element in the list Cost of online: n ¢ t Cost of best fixed list: sort the list according to popularity – Average cost: · 1/2n –Total cost: · 1/2n ¢ t

List Update Problem: Static Optimality A more modest performance goal: compete with the best algorithm that fixes the permutation in advance Blum-Chowla-Kalai: can be 1+o(1) competitive wrt best static algorithm (probabilistic) BCK algorithm based on number of times each element has been requested. Algorithm: –Start with random weights r i in range [1,c] –At all times w i = r i + c i c i is # of times element a i was requested. –At any point in time: arrange elements according to weights

Privacy with Static Optimality Algorithm: –Start with random weights r i in range [1,c] –At any point in time w i = r i + c i c i is # of times element a i was requested. –Arrange elements according to weights – Privacy : from privacy of counters list depends on counters plus randomness – Accuracy : can show that BCK proof can be modified to handle approximate counts as well –What about efficiency ? Run with private counter

The multi-counter problem How to run n counters for T time steps In each round: few counters are incremented – Identity of incremented counter is kept private Work per increment: logarithmic in n and T Idea: arrange the n counters in a binary tree with n leaves –Output counters associated with leaves –For each internal node : maintain a counter corresponding to sum of leaves in subtree

Determines when to update subtree The multi-counter problem (internal, output) Idea: arrange the n counters in a binary tree with n leaves –Output counters associated with leaves For each internal node maintain: – Counter corresponding to sum of leaves in subtree – Register with number of increments since last output update When a leaf counter is updated: –All log n nodes to root are incremented –Internal state of root updated. –If output of parent node updated, internal state of children updated

Tree of Counters Output counter (counter, register)

The multi-counter problem Work per increment: –log n increment + number of counter need to update – Amortized complexity is O(n log n /k) k number of times we expect to increment a counter until output is updated Privacy: each increment of a leaf counter effects log n counters Accuracy : we have introduced some delay: –After t ¸ k log n increments all nodes on path have been update

Pan-Privacy In privacy literature: data curator trusted In reality : even well-intentioned curator subject to mission creep, subpoena, security breach… –Pro baseball anonymous drug tests –Facebook policies to protect users from application developers –Google accounts hacked stores Goal: curator accumulates statistical information, but never stores sensitive data about individuals Pan-privacy: algorithm private inside and out internal state is privacy-preserving. “think of the children”

Randomized Response [Warner 1965] Method for polling stigmatizing questions Idea : participants lie with known probability. –Specific answers are deniable – Aggregate results are still valid Data never stored “in the clear” popular in DB literature [MiSa06] 1 noise … User Data User Response Strong guarantee: no trust in curator Makes sense when each user’s data appears only once, otherwise limited utility New idea: curator aggregates statistical information, but never stores sensitive data about individuals

Aggregation Without Storing Sensitive Data? Streaming algorithms: small storage –Information stored can still be sensitive –“My data”: many appearances, arbitrarily interleaved with those of others Pan-Private Algorithm –Private “inside and out” –Even internal state completely hides the appearance pattern of any individual: presence, absence, frequency, etc. “User level”

Can also consider multiple intrusions Pan-Privacy Model Data is stream of items, each item belongs to a user Data of different users interleaved arbitrarily Curator sees items, updates internal state, output at stream end Pan-Privacy For every possible behavior of user in stream, joint distribution of the internal state at any single point in time and the final output is differentially private state output

Universe U of users whose data in the stream; x 2 U Streams x -adjacent if same projections of users onto U\{x} Example: axbxcxdxxxex and abcdxe are x-adjacent Both project to abcde Notion of “corresponding locations” in x -adjacent streams U -adjacent: 9 x 2 U for which they are x -adjacent –Simply “adjacent,” if U is understood Note: Streams of different lengths can be adjacent Adjacency: User Level

Example: Stream Density or # Distinct Elements Universe U of users, estimate how many distinct users in U appear in data stream Application: # distinct users who searched for “flu” Ideas that don’t work: Naïve Keep list of users that appeared (bad privacy and space) Streaming –Track random sub-sample of users (bad privacy) –Hash each user, track minimal hash (bad privacy)

Pan-Private Density Estimator Inspired by randomized response. Store for each user x 2 U a single bit b x Initially all b x 0 w.p. ½ 1 w.p. ½ When encountering x redraw b x 0 w.p. ½-ε 1 w.p. ½+ε Final output: [( fraction of 1 ’s in table - ½ )/ε] + noise Pan-Privacy If user never appeared: entry drawn from D 0 If user appeared any # of times : entry drawn from D 1 D 0 and D 1 are 4ε -differentially private Distribution D 0 Distribution D 1

Pan-Private Density Estimator Inspired by randomized response. Store for each user x 2 U a single bit b x Initially all b x 0 w.p. ½ 1 w.p. ½ When encountering x redraw b x 0 w.p. ½-ε 1 w.p. ½+ε Final output: [( fraction of 1 ’s in table - ½ )/ε] + noise Improved accuracy and Storage Multiplicative accuracy using hashing Small storage using sub-sampling Distribution D 0 Distribution D 1

Theorem [density estimation streaming algorithm] ε pan-privacy, multiplicative error α space is poly(1/α,1/ε) Pan-Private Density Estimator

The Dynamic Privacy Zoo Differentially Private Outputs Privacy under Continual Observation Pan Privacy User level Privacy Continual Pan Privacy Petting Sketch vs. Stream