Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.

Slides:



Advertisements
Similar presentations
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Advertisements

A Privacy Preserving Index for Range Queries
Fast Algorithms For Hierarchical Range Histogram Constructions
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
Privacy Enhancing Technologies
Visual Recognition Tutorial
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Differentially Private Aggregation of Distributed Time-Series Vibhor Rastogi (University of Washington) Suman Nath (Microsoft Research)
Overview Of Clustering Techniques D. Gunopulos, UCR.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Differential Privacy (2). Outline  Using differential privacy Database queries Data mining  Non interactive case  New developments.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Data mining and machine learning A brief introduction.
CS573 Data Privacy and Security Statistical Databases
Random Sampling, Point Estimation and Maximum Likelihood.
Access Path Selection in a Relational Database Management System Selinger et al.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Histograms for Selectivity Estimation
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Part 3: Query Processing -- Data-Independent Methods 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang 3, Gerome Miklau 4 1 University.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Privacy-preserving data publishing
Genomic Data Privacy Protection Using Compressive Sensing 1 »University of Oklahoma -Tulsa Aminmohammad Roozgard, Nafise Barzigar, Dr. Pramode Verma, Dr.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
Classification Ensemble Methods 1
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
Secure Data Outsourcing
Space for things we might want to put at the bottom of each slide. Part 6: Open Problems 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
Privacy-Preserving Data Mining
Private Data Management with Verification
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Overview Of Clustering Techniques
Differential Privacy in Practice
Data Mining Practical Machine Learning Tools and Techniques
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Differential Privacy (2)
Finding Periodic Discrete Events in Noisy Streams
Wavelet-based histograms for selectivity estimation
Published in: IEEE Transactions on Industrial Informatics
Some contents are borrowed from Adam Smith’s slides
Presentation transcript:

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang  Zhenjie Zhang  Gerome Miklau Prev. Session: Marianne Winslett  Xiaokui Xiao 1

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 What we talked in the last session Privacy is a major concern in data publishing Simple anonymization methods fail to provide sufficient privacy protection Definition of differential privacy Hard to tell if a record is in the DB from query results Plausible deniability Basic solutions Laplace mechanism: inject Laplace noise into query results Exponential mechanism: choose a result randomly; a “good” result has higher probability Data independent methods 2

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Data independent vs. data dependent Data independent methods Data dependent methods Sensitive infoQuery resultsQuery results + data dependent parameters Error sourceInjected noiseInjected noise + information loss Noise typeUnbiasedOften Biased Asymptotic error boundHigherLower, with data dependent constants Practical accuracyHigherLower for some data 3

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Types of data dependent methods Type 1: optimizing noisy results 1. Inject noise 2. Optimize the noisy query results based on their values Type 2: transforming original data 1. Transform the data to reduce the amount of necessary noise 2. Inject noise 4

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Optimizing noisy results: Hierarchical Strategy presented in the last session. Hierarchical strategy: tree with count in each node Data dependent optimization: If a node N has noisy count close to 0 Set the noisy count at N to 0 5 Noisy count: 0.05 Optimized count: 0 Hay et al. Boosting the Accuracy of Differentially-Private Queries Through Consistency, VLDB’10.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Optimizing noisy results: iReduct Setting: answer a set of m queries Goal: minimize their total relative error RelErr = (noisy result – actual result) / actual result Example: Two queries, q 1 and q 2 Actual results: q 1 :10, q 2 :20 Observation: we should add less noise to q 1 than to q 2 6 Xiao et al. iReduct: Differential Privacy with Reduced Relative Errors, SIGMOD’11.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Answering queries differently leads to different total relative error Continuing the example Two queries, q 1 and q 2, with actual answers 10 and 20 Suppose each of q 1 and q 2 has sensitivity 1 Two strategies: Answer q 1 with ε /2, q 2 with ε /2 Noise on q 1 : 2/ ε Answer q 1 with 2 ε /3, q 2 with ε /3 Noise on q 1 : 1.5 ε Noise variance on q 1 : 3/ ε 7 Lower relative error overall But we don’t know which strategy is better before comparing their actual answers!

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Idea of iReduct 1. Answer all queries with privacy budget ε/t 2. Refine the noisy results with budget ε/t more budget on queries with smaller results How to refine a noisy count? Method 1: obtain a new noisy version, compute weighted average with the old version Method 2: obtain a refined version directly from a complicated distribution 3. Repeat the last step t  1 times 8

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Example of iReduct 9 q1q1 q2q2 Iteration 1: 1614 ε /2t ε /t  14/30 ε /2t ε /t  16/ Iteration 2: ε /t  2/3 ε /t  1/3 922 …… Iteration 3: ε /t  22/31 ε /t  9/31

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Optimizing noisy results: MW Problem: publish a histogram under DP that is optimized for a given query set. Idea: Start from a uniform histogram. Repeat the following t times Evaluate all queries. Find the query q with the worst accuracy. Modify the histogram to improve the accuracy of q using a technique called multiplicative weights (MW) 10 Hardt et al. A simple and practical algorithm for differentially private data release, arXiv.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Example of MW 11 Exact histogram q1q1 q2q2 Initial histogram Range count queries q1q1 q2q2 less accurate No privacy budget cost! Iteration 1: optimize q 1 privacy cost: ε /t q1q1 q2q2 still less accurate Iteration 2: optimize q 1 privacy cost: ε /t q1q1 q2q2 less accurate Iteration 3: optimize q 2 privacy cost: ε /t q1q1 q2q2

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Optimizing noisy results: NoiseFirst Problem: publish a histogram 12 Xu et al. Differentially Private Histogram Publication, ICDE’12. Original data in a medical statistical DB Histogram

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Reduce error by merging bins 13 Noisy histogram Exact histogram Optimized histogram Bin-merging scheme computed through dynamic programming Positive/negative noise cancels out!

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Next we focus on the second type. Type 1: optimizing noisy results 1. Inject noise 2. Optimize the noisy query results based on their values Type 2: transforming original data 1. Transform the data to reduce the amount of necessary noise 2. Inject noise 14

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Transforming data: StructureFirst An alternative solution for histogram publication 15 Original histogramHistogram after merging bins ∆=1 ∆=1/3 ∆=1/2 Lower sensitivity means less noise! Xu et al. Differentially Private Histogram Publication, ICDE’12. Related: Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 But the optimal structure is sensitive! 16 Original Histogram Diff. optimal structuresWith/without Alice Alice is an HIV+ patient ! Alice

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 StructureFirst uses the Exponential Mechanism to render its structure differentially private. Randomly perturb the optimal histogram structure Set each boundary using the exponential mechanism ¢¢¢ Original histogram merge bins (k*=3) Randomly adjust boundaries Lap(∆/ε) noise Consume ε 1 Consume ε 2 = (ε-ε 1 ) Satisfies ε-DP

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Observations on StructureFirst Merging bins essentially compresses the data Reduced sensitivity vs. information loss Question: can we apply other compression algorithms? Yes! Method 1: Perform Fourier transformation, take the first few coefficients, discard all others Rastogi and Nath. Differentially Private Aggregation Of Distributed Time-series With Transformation And Encryption, SIGMOD’10 Method 2: apply the theory of sparse representation Li et al. Compressive Mechanism: Utilizing Sparse Representation in Differential Privacy, WPES’11 Hardt and Roth. Beating Randomized Response on Incoherent Matrices. STOC’12 Your new paper? 18

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Transforming original data: k-d-tree Problem: answer 2D range count queries Solution: index the data with a k-d-tree 19 Cormode et al. Differentially Private Space Decompositions. ICDE’12. Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM, 2010 The k-d-tree structure is sensitive!

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 How to protect the k-d-tree structure? Core problem: differentially private median. Method 1: exponential mechanism. (best) [1] Method 2: simply replace mean with median. [3] Method 3: cell-based method. [2] Partition the data with a grid. Compute differentially private counts using the grid. 20 [1] Cormode et al. Differentially Private Space Decompositions. ICDE’12. [2] Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10. [3] Inan et al. Private Record Matching Using Differential Privacy. EDBT’10.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Transforming original data: S&A S&A: Sample and Aggregate Goal: answer a query q whose result does not dependent on the dataset cardinality, e.g., avg Idea 1: Randomly partition the dataset into m blocks Evaluate q on each block Return average over m blocks + Laplace noise Sensitivity: (max-min)/m Idea 2: median instead of average + exponential mechanism Sensitivity is 1! Zhenjie has more 21 Mohan et al. GUPT: Privacy Preserving Data Analysis Made Easy. SIGMOD’12. Smith. Privacy-Preserving Statistical Estimation with Optimal Convergence Rates. STOC’11.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Systems using Differential Privacy Privacy on the Map PINQ Airavat GUPT 22

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Summary on data dependent methods Data dependent vs. data independent Optimizing noisy results Simple optimizations Iterative methods Transforming original data Reduced sensitivity Caution: parameters may reveal information Next: Zhenjie on differentially private data mining 23