When Random Sampling Preserves Privacy Kamalika Chaudhuri U.C.Berkeley Nina Mishra U.Virginia.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

AP Statistics Course Review.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Sampling Distributions (§ )
Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.
Heavy hitter computation over data stream
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Chapter 7 Sampling and Sampling Distributions
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Standard error of estimate & Confidence interval.
Determination of Upperbound Failure Rate by Graphic Confidence Interval Estimate K. S. Kim (Kyo) Los Alamos National Laboratory Los Alamos, NM
Body size distribution of European Collembola Lecture 9 Moments of distributions.
Overview Definition Hypothesis
The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
CS573 Data Privacy and Security Statistical Databases
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Chapter 7 Estimates and Sample Sizes
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Estimating the Value of a Parameter Using Confidence Intervals
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
SIA: Secure Information Aggregation in Sensor Networks B. Przydatek, D. Song, and A. Perrig. In Proc. of ACM SenSys 2003 Natalia Stakhanova cs610.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
SAMPLING DISTRIBUTIONS
Privacy-preserving data publishing
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Chapter 9 Inferences Based on Two Samples: Confidence Intervals and Tests of Hypothesis.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Lecture 22 Dustin Lueker.  Similar to testing one proportion  Hypotheses are set up like two sample mean test ◦ H 0 :p 1 -p 2 =0  Same as H 0 : p 1.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Continuous Probability Distributions
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Cardinality Estimator 2014/2016
Differential Privacy in Practice
The Curve Merger (Dvir & Widgerson, 2008)
Sublinear Algorihms for Big Data
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Compact routing schemes with improved stretch
Minwise Hashing and Efficient Search
CS639: Data Management for Data Science
(Learned) Frequency Estimation Algorithms
Presentation transcript:

When Random Sampling Preserves Privacy Kamalika Chaudhuri U.C.Berkeley Nina Mishra U.Virginia

The Problem  Setting: Table : Set of rows Sanitizer: Releases each row with probability p  What are the conditions under which this sanitizer preserves privacy? Database Sanitizer Sanitized Database

Search Data  AOL released user search data: Replaced usernames with random ids

Search Data “Berkeley restaurants” “Low degree spanning trees” “Tickets to India” “Privacy sampling” “Airfare Santa Barbara” Kamalika “Traffic on 101N” “Restaurants Mountain View” “Rank Aggregation” “Memory bound functions” “Crypto registration” “Falafel Charlottesville” “Query Auditing” “Clustering streaming” “Tickets to SFO” “Privacy sampling” CynthiaNina

U.S. Census Data  Random sample of preprocessed data: Removing unique values Merging cells with less than a threshold number of individuals

Privacy Definition [DMNS06,…]  -Indistinguishability Two tables T, T’, differ by a single row S : Output of the sanitizer Pr[S | T] ≤ (1 + ) Pr[S | T’] TT’ S

An Example  Cannot always get -Indistinguishability with random sampling T : n rows with value 0 T’ : n-1 rows with value 0, 1 row with value 1 S : 1 row with value 1, s – 1 rows with value 0 TT’ S

Privacy Definition[DKMMiNa06,BDMN05]  (,-Indistinguishability : Two tables T, T’, differ by a single row S : Output of the sanitizer With probability at least 1 - , Pr[S | T] ≤ (1 + ) Pr[S | T’] TT’ S

An Example  Cannot always get (,- Indistinguishability for all tables A table where all rows have unique values TT’ S

When does Random Sampling preserve Privacy?  Parameters: (, )-indistinguishability k : number of distinct values in T t : number of values which occur at most log(k/)/ times in T  Theorem: This can be guaranteed if p <  (if t = 0) p < Õ(  /t)

Classification of Values Rare Value Infrequent Value Common Value Number of rows with value v log(k/)/log(k/)/p For (, )-indistinguishability:

Rare Values  If a rare value v is observed in a random sample, Pr[S|T’]>(1 +  log(k  Pr[S|T] TT’ S

Common Values  For a common value v, Pr[S|T] ≈ Pr[S|T’]  Typically, the number of rows with a common value is close to its expectation TT’ S log(k/ )/ log(k/)/p RareCommonInfrequent

Infrequent Values  For an infrequent value v, Pr[S|T] ≈ Pr[S|T’]  Typically, the number of rows with an infrequent value is at most log(k/) away from its expected value TT’ S log(k/ )/ log(k/)/p RareCommonInfrequent

Properties of a Good Sample  A sample S is -indistinguishable if: No rare values The number of rows with common value v is within a constant factor of expectation The number of rows with infrequent value v is at most an additive O(log(k/)) more than its expected value

When does Random Sampling preserve Privacy?  Such a sample occurs with probability at least 1 -  if p <  (if t=0) p < Õ(  /t)

Utility of Random Sampling  Assuming no rare values: Error in the frequency of each value : additive 1/√n  [DMNS06] Estimates histogram with an additive error of 1/n in each frequency  Sampling may give a compact representation of the histogram

Conclusions  Random sampling preserves privacy only when there are few rare values  With rare values, the probability of failure can be high  = (1/n) as opposed to 1/2^n [DKMMiNa06, BDMN05]  Error in estimating the frequency of each value can be high Additive 1/√n as opposed to 1/n of [DMNS06]

Thank You

The Problem  What are the conditions under which this sanitizer preserves privacy?