Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.

Slides:



Advertisements
Similar presentations
Operating System Security
Advertisements

Set 6, Statistical Database Security
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Intrusion detection Anomaly detection models: compare a user’s normal behavior statistically to parameters of the current session, in order to find significant.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
1 Introduction Introduction to database systems Database Management Systems (DBMS) Type of Databases Database Design Database Design Considerations.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Last time Finish OTR Database Security Introduction to Databases
Present by Napasakorn Sukjay Poom Samaharn
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model.
Chapter 6 – Database Security  Integrity for databases: record integrity, data correctness, update integrity  Security for databases: access control,
Chapter 3 Single-Table Queries
Database Security And Audit. Databasics Data is stored in form of files Record : is a one related group of data (in a row) Schema : logical structure.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
1 Oracle Database 11g – Flashback Data Archive. 2 Data History and Retention Data retention and change control requirements are growing Regulatory oversight.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Database Security DBMS Features Statistical Database Security.
CS573 Data Privacy and Security Statistical Databases
Sensitive Data  Data that should not be made public  What if some but not all of the elements of a DB are sensitive Inherently sensitiveInherently sensitive.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Copyright © 2007 Pearson Education Canada 1 Chapter 13: Audit of the Sales and Collection Cycle: Tests of Controls.
Computer Security: Principles and Practice
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
CS573 Data Privacy and Security Anonymization methods Li Xiong.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Chapter No 4 Query optimization and Data Integrity & Security.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Database Security Outline.. Introduction Security requirement Reliability and Integrity Sensitive data Inference Multilevel databases Multilevel security.
Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.
Inference Problem Privacy Preserving Data Mining.
Computer Science and Engineering Computer System Security CSE 5339/7339 Session 21 November 2, 2004.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Academic Year 2014 Spring Academic Year 2014 Spring.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Inference Problem. Access Control Policies Direct access Information flow Not addressed: indirect data access CSCE Farkas 2 Lecture 19.
Inference Problem Privacy Preserving Data Mining.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
Towards Robustness in Query Auditing Shubha U. Nabar Stanford University VLDB 2006 Joint Work With B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Big Data Analytics Are we at risk? Dr. Csilla Farkas Director Center for Information Assurance Engineering (CIAE) Department of Computer Science and Engineering.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Access control techniques Once an organization decides upon the access control model it will implement(DAC,MAC, or RBAC), then it needs to look at the.
University of Texas at El Paso
Chapter 3 Database Management
Differential Privacy (1)
Database management systems
Security in Computing, Fifth Edition
Presentation transcript:

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

A statistical database is a database which provides statistics on subsets of records OLAP vs. OLTP Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records Statistical Database

Types of Statistical Databases  Static – a static database is made once and never changes  Example: U.S. Census  Dynamic – changes continuously to reflect real-time data  Example: most online research databases

Types of Statistical Databases  Centralized – one database  Decentralized – multiple decentralized databases  General purpose – like census  Special purpose – like bank, hospital, academia, etc

Access Restriction  Databases normally have different access levels for different types of users  User ID and passwords are the most common methods for restricting access  In a medical database:  Doctors/Healthcare Representative – full access to information  Researchers – only access to partial information (e.g. aggregate information) Statistical database: allow query access only to aggregate data, not individual records

Accuracy vs. Confidentiality Accuracy – Researchers want to extract accurate and meaningful data Confidentiality – Patients, laws and database administrators want to maintain the privacy of patients and the confidentiality of their information

Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance Positive compromise – determine an attribute has a particular value Negative compromise – determine an attribute does not have a particular value Relative compromise – determine the ranking of some confidential values Data Compromise

Security Methods  Query restriction  Data perturbation/anonymization  Output perturbation

Comparison Query restriction techniques cannot avoid inference, but they accurate response to valid queries. Data perturbation techniques can prevent inference, but they cannot consistently provide useful query results. Output perturbation has low storage and computational overhead, however, is subject to the inference (averaging effect) and inaccurate results.

Statistical database vs. data anonymization Data anonymization is one technique that can be used to build statistical database Data anonymiztion can be used to release data for other purposes such as mining Other techniques such as query restriction and output purterbation can be used to build statistical database

Evaluation Criteria Security – level of protection Statistical quality of information – data utility Cost Suitability to numerical and/or categorical attributes Suitability to multiple confidential attributes Suitability to dynamic statistical DBs

Security Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance Statistical disclosure control – require a large number of queries to obtain a small variance of the estimator

Statistical Quality of Information Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate Precision – variance of the estimators obtained by users Consistency – lack of contradictions and paradoxes Contradictions: different responses to same query; average differs from sum/count Paradox: negative count

Cost Implementation cost Processing overhead Amount of education required to enable users to understand the method and make effective use of the SDB

Security Methods  Query set restriction  Query size control  Query set overlap control  Query auditing  Data perturbation/anonymization  Output perturbation

Query Set Size Control  A query-set size control limit the number of records that must be in the result set  Allows the query results to be displayed only if the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Query Set Size Control

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

Query set size control With query set size control the database can be easily compromised within a frame of 4-5 queries For query set control, if the threshold value k is large, then it will restrict too many queries And still does not guarantee protection from compromise

Basic idea: successive queries must be checked against the number of common records. If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(C) is only allowed if: |X (C) X (D) | ≤ r, r > 0 Where α is set by the administrator Number of queries needed for a compromise has a lower bound 1 + (K-1)/r Query Set Overlap Control

Query-set-overlap control Ineffective for cooperation of several users Statistics for a set and its subset cannot be released – limiting usefulness Need to keep user profile High processing overhead – every new query compared with all previous ones

Auditing Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage requirements “Efficient” methods for special types of queries

Audit Expert (Chin 1982) Query auditing method for SUM queries A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result A set of SUM queries can be thought of as a system of linear equations Maintains the binary matrix representing linearly independent queries and update it when a new query is issued A row with all 0s except for ith column indicates disclosure

Audit Expert Only stores linearly independent queries Not all queries are linearly independent Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

Audit Expert O(L 2 ) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries No restrictions on query set size Maximizing non-confidential information is NP-complete

Auditing – recent developments Online auditing “Detect and deny” queries that violate privacy requirement Denial themselves may implicitly disclose sensitive information Offline auditing Check if a privacy requirement has been violated after the queries have been executed Not to prevent

Security Methods  Query set restriction  Data perturbation/anonymization  Partitioning  Cell suppression  Microaggregation  Data perturbation  Output perturbation

Partitioning Cluster individual entities into mutually exclusive subsets, called atomic populations The statistics of these atomic populations constitute the materials

Microaggregation

Data Perturbation

Security Methods  Query set restriction  Data perturbation/anonymization  Output perturbation Sampling Varying output perturbation Rounding

Output Perturbation  Instead of the raw data being transformed as in Data Perturbation, only the output or query results are perturbed  The bias problem is less severe than with data perturbation

Output Perturbation Query Results

Random Sampling  Only a sample of the query set (records meeting the requirements of the query) are used to compute and estimate the statistics  Must maintain consistency by giving exact same results to the same query  Weakness - Logical equivalent queries can result in a different query set – consistency issue

Varying output perturbation Apply perturbation on the query set Less bias than data perturbation

Some ComparisonsMethodSecurity Richness of Information Costs Query-set-size control Low Low 1 Low Query-set-overlap control LowLowHigh AuditingModerate-LowModerateHigh PartitioningModerate-lowModerate-lowLow MicroaggregationModerateModerateModerate Data Perturbation HighHigh-ModerateLow Varying Output Perturbation ModerateModerate-lowLow SamplingModerateModerate-LowModerate 1 Quality is low because a lot of information can be eliminated if the query does not meet the requirements

Sources  Partial slides:  Adam, Nabil R. ; Wortmann, John C.; Security- Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989  Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009