CS573 Data Privacy and Security Statistical Databases

Slides:



Advertisements
Similar presentations
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Advertisements

Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System ` Introduction With the deployment of smart card automated.
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Evaluating Hypotheses
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Differential Privacy (2). Outline  Using differential privacy Database queries Data mining  Non interactive case  New developments.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Data Warehouse Operational Issues Potential Research Directions.
Differential Privacy - Apps Presented By Nikhil M Chandrappa 1.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
OnLine Analytical Processing (OLAP)
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
CS573 Data Privacy and Security Anonymization methods Li Xiong.
Chapter No 4 Query optimization and Data Integrity & Security.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.
Computer Science and Engineering Computer System Security CSE 5339/7339 Session 21 November 2, 2004.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Academic Year 2014 Spring Academic Year 2014 Spring.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Data Mining and Decision Support
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Secure Data Outsourcing
Towards Robustness in Query Auditing Shubha U. Nabar Stanford University VLDB 2006 Joint Work With B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Space for things we might want to put at the bottom of each slide. Part 6: Open Problems 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Dense-Region Based Compact Data Cube
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Privacy-Preserving Data Mining
Private Data Management with Verification
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Inference and Flow Control
Database management concepts
Differential Privacy (2)
Published in: IEEE Transactions on Industrial Informatics
Differential Privacy (1)
Database management systems
Presentation transcript:

CS573 Data Privacy and Security Statistical Databases Li Xiong

Today Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy

Statistical Data Release Population count city Age City Diagnosis 25 Lilburn mantle cell lymphoma 35 Decatur adult T-cell lymphoma 20 30 40 50 50 Age Diagnosis Release statistical summary of the data (vs. individual records) Useful for analysis and learning Medical statistics Query log statistics – frequent search terms Still need rigorous inference control

Statistical Database A statistical database is a database which provides statistics on subsets of records Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records Inference control to prevent inference from statistics to individual records

Methods Data perturbation/anonymization Query restriction Output perturbation

Data Perturbation Perturbed data is raw data with noise added Pro: With perturbed databases, if unauthorized data is accessed, the true value is not disclosed Con: Data perturbation runs the risk of presenting biased data

Query Resitrction

Output Perturbation Query Results Results Query

Methods Data perturbation/anonymization Query restriction Query set size control Query set overlap control Query auditing Output perturbation

Query Set Size Control A query-set size control limit the number of records that must be in the result set Allows the query results to be displayed only if the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2 Setting a minimum query-set size can help protect against the disclosure of individual data Why do we need upper bound?

Query Set Size Control

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1?

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

Query set size control If the threshold value k is large, then it will restrict too many queries And still does not guarantee protection from compromise The database can be easily compromised within a frame of 4-5 queries

Query Set Overlap Control Basic idea: successive queries must be checked against the number of common records. If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator Number of queries needed for a compromise has a lower bound 1 + (K-1)/r

Query-set-overlap control Statistics for a set and its subset cannot be released – limiting usefulness High processing overhead – every new query compared with all previous ones Multiple users - need to keep user profile, need to consider collusion between users Still no formal privacy guarantee

Auditing Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage requirements Only “efficient” methods for special types of queries

Audit Expert (Chin 1982) Query auditing method for SUM queries A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result A set of SUM queries can be thought of as a system of linear equations Maintains the binary matrix representing linearly independent queries and update it when a new query is issued A row with all 0s except for ith column indicates disclosure

Audit Expert Only stores linearly independent queries Not all queries are linearly independent Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

Audit Expert O(L2) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries Maximizing non-confidential information is NP-complete

Auditing – recent developments Online auditing “Detect and deny” queries that violate privacy requirement Denial themselves may implicitly disclose sensitive information Offline auditing Check if a privacy requirement has been violated after the queries have been executed Not to prevent

Methods Data perturbation/anonymization Query restriction Output perturbation Differential privacy 22

Differential Privacy Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set E.g.: Q = select count() where Age = [20,30] and Diagnosis = B Output Perturbation D2 Bob out User Q D1 Bob in A(D2) A(D1)

Differential Privacy Differential privacy Laplace mechanism Q(D) + Y where Y is drawn from Query sensitivity Differentially Private Interface D2 Bob out User Q D1 Bob in A(D1) = Q(D1) + Y1 A(D2) = Q(D2) + Y2

Composition of Differential Privacy Sequential composition [McSherry SIGMOD 09] Let Mi each provides differential privacy. The sequence of Mi provides differential privacy Parallel composition If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di, then the sequence of Mi provides differential privacy. D1 Bob in A1(D1), A2(D1), … Interactive approach has drawbacks Differentially Private Interface Q1,Q2, … User D2 Bob out A1(D2), A2(D2), …

Differential Privacy Is unfettered access to raw data truly essential? Is released data sufficient (provide sufficient utility guarantee)? Privacy mechanism Raw Data Released Data User count Released data can be sanitized data, statistical data or synthetic data If the privacy mechanism is interactive, the sequence of results can be viewed as released data Impossible to guarantee sufficiency for all (any) data or all (any) applications Age City Diagnosis 25 Lilburn mantle cell lymphoma 35 Decatur adult T-cell lymphoma city Age Diagnosis

Challenges Differential privacy cost accumulates quickly with number of queries Typical tasks require multiple queries or multiple steps Need to support multiple users Impossible to guarantee utility for all (any) data or all (any) applications

Possible Middle Ground Guaranteed utility for certain applications Counting queries, classification, logistic regression Guaranteed utility for certain kinds of data Use prior or domain knowledge about data Use intermediate results (differentially private) Target Applications Prior or domain knowledge Released data can be sanitized data, statistical data or synthetic data If the privacy mechanism is interactive, the sequence of results can be viewed as released data Impossible to guarantee sufficiency for all (any) data or all (any) applications Sample size (small vs. large); Density (dense vs. sparse); Distribution characteristics … Exploratory, incrementally getting knowledge about the data How to optimally allocate the privacy budget? Intermediate Result Privacy mechanism Raw Data Released Data User

Our Research: Adaptive Differentially Private Data Release Data knowledge Dense and “smooth” data High dimensional and sparse data Dynamic data Application knowledge Query workload Specific tasks

Histogram Example ? Histogram release for counting queries, e.g. clinical trials population screening for clinical trials using predicates represented as inclusion, exclusion criteria

Strategy I: Baseline Cell Partitioning Q1: count() where Age = 20, Diagnosis = A Q2: count() where Age = 20, Diagnosis = B … diagnosis Diagnosis A B A B Q 20 50 10 90 20 50’ 10’ 90’ Age DP Age 30 alpha 30 Goal: to release a differentially private histogram to support random predicate queries Q: select count() where Age = [20,30] and Income = 40K If a query predicate consists of multiple cells or partitions, it will have aggregated perturbation error

Strategy II: Hierarchical Partitioning A B 20 200’ diagnosis A B alpha/3 30 20 50 10 90 Age A B 30 alpha/3 20 60’ 140’ 30 alpha/3 A B 20 50’ 10’ 90’ 30 Large perturbation error due to small divided privacy budget at each level

DPCube Strategy: Two phase partitioning diagnosis A B A B 20 50 10 90 20 100’ 10’ 90’ Age Age 30 30 If a query predicate is contained in a published partition, the answer has to be estimated typically based on a uniform distribution assumption. This introduces an approximation error.

DPCube Strategy: Two phase partitioning A B 1. Cell Partitioning 20 50’ 10’ 90’ Cell histogram diagnosis 30 A B 20 50 10 90 Age 2. Multi-dimensional Partitioning A B 30 20 50’ 10’ 90’ 30 A B 20 100’ 10’ 90’ partition histogram 30

Partitioning Algorithm Define a uniformity (randomness) measure for a partition H(Dt) information gain, variance Recursive algorithm Partition(Dt) for a given partition Dt Find the best splitting point (e.g. largest information gain) and Partition the data into Dt1 and Dt2 Partition(Dt1) and Partition(Dt2)

Privacy and Utility of the Released Histogram The released data satisfies -differential privacy Support for count queries and other OLAP queries and learning tasks Formal utility results (epsilon,delta) - usefulness Experimental results for partition histogram CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100) Report absolute error and relative error for random count queries

DPCube Result Example Original histogram Diff. Private Cell histogram Diff. private partition histogram Diff. Private Estimated Cell histogram

Experimental Results: Comparison with other partitioning strategies Higher alpha (lower privacy) results in lower error (higher utility) Kd tree based approach outperforms others Cell partitioning is comparable in absolute error but suffers in relative error due to the sparsity of the data

High dimensional sparse data Many real-world data are high dimensional and sparse Web search log data, web transactions, etc. A direct application of the 2-phase approach Cell histogram highly inaccurate Computationally not scalable

Top-down recursive partitioning Recursively partition the spaces that have sufficient density Use a context free taxonomy tree Dynamically allocate and keep track of the budget

Adaptive Hierarchical Strategy 1a. Overall count Data is sparse and Highly dimensional 1b. Partitioning of non-sparse regions 2a. Partition count For sparse data, A recursive top-down approach, only partition dense regions and getting further counts for those. For dense regions, preserve the budget more for more partitioning 2b. Partitioning of non-sparse regions n. Partition count

Today Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy