Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Attacking Cryptographic Schemes Based on Perturbation Polynomials Martin Albrecht (Royal Holloway), Craig Gentry (IBM), Shai Halevi (IBM), Jonathan Katz.
Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
1 Transportation problem The transportation problem seeks the determination of a minimum cost transportation plan for a single commodity from a number.
Overview of IS Controls, Auditing, and Security Fall 2005.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Ch 9.1: The Phase Plane: Linear Systems
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 4 By Herb I. Gross and Richard A. Medeiros next.
Design of Engineering Experiments - Experiments with Random Factors
Introduction to Analysis of Algorithms
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Complexity Analysis (Part I)
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Development of Empirical Models From Process Data
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Session 6: Introduction to cryptanalysis part 1. Contents Problem definition Symmetric systems cryptanalysis Particularities of block ciphers cryptanalysis.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Copyright © Cengage Learning. All rights reserved. 5 Integrals.
Autar Kaw Humberto Isaza Transforming Numerical Methods Education for STEM Undergraduates.
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Key Stone Problem… Key Stone Problem… next Set 22 © 2007 Herbert I. Gross.
ME 2304: 3D Geometry & Vector Calculus Dr. Faraz Junejo Double Integrals.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Simplex method (algebraic interpretation)
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Ad Hoc Constraints Objectives of the Lecture : To consider Ad Hoc Constraints in principle; To consider Ad Hoc Constraints in SQL; To consider other aspects.
Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
CSC 211 Data Structures Lecture 13
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Chapter 1 Introduction n Introduction: Problem Solving and Decision Making n Quantitative Analysis and Decision Making n Quantitative Analysis n Model.
A Semi-Blind Technique for MIMO Channel Matrix Estimation Aditya Jagannatham and Bhaskar D. Rao The proposed algorithm performs well compared to its training.
Data Representation in Computer Systems. 2 Signed Integer Representation The conversions we have so far presented have involved only positive numbers.
12 INFINITE SEQUENCES AND SERIES. In general, it is difficult to find the exact sum of a series.  We were able to accomplish this for geometric series.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
OR Chapter 8. General LP Problems Converting other forms to general LP problem : min c’x  - max (-c)’x   = by adding a nonnegative slack variable.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
DATA & COMPUTER SECURITY (CSNB414) MODULE 3 MODERN SYMMETRIC ENCRYPTION.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 16 By Herbert I. Gross and Richard A. Medeiros next.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Approximation Algorithms based on linear programming.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Computational Geometry
Lecture 9 Numerical Analysis. Solution of Linear System of Equations Chapter 3.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
12. Principles of Parameter Estimation
Perturbation method, lexicographic method
12. Principles of Parameter Estimation
Presentation transcript:

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University

Query/Response Systems Renewed interest in query/response systems due to easy communication facilities Fits nicely in a remote access environment

Input versus Output Perturbation Input perturbation – The original data is modified. All responses to queries are based on the modified data Output perturbation – The response is computed using the original data and modified prior to release – Advantages of output perturbation Easier to implement Data updates are easy Fits nicely in a remote access environment

Analytical Validity One key component of providing responses to queries is to assure the intruder that the response is meaningful For ad hoc queries, it may be difficult to provide a priori assurances regarding analytical validity Solution: Interval Responses

Interval Response As the name implies, the response to every query is provided in the form of an interval instead of a single value Allows the users to directly assess the analytical accuracy of the response – For a given query, response (1000 – 2000) is much less accurate than the response ( ) The true value is guaranteed to be in the response interval

Deterministic Methods Determinism is often visualized only in terms of the masking method employed – Perturbed value = a + (b × True value) a and b are constants Knowledge of two true values is adequate to compromise the entire database Providing the guarantee that the interval response will contain the true value is also deterministic

Determinism versus Disclosure It is well known that data masking techniques that are purely deterministic are subject to complete, exact disclosure of the confidential values But what if the determinism occurs in terms of the response? Are methods which provide deterministic guarantees regarding the response subject to the same type of complete, exact disclosure?

Confidentiality via Camouflage (CVC) A procedure for providing interval responses to queries – Can be implemented for both binary and numerical data – Intervals computed using this procedure guaranteed to contain the true response

CVC for Binary Data Procedure – a represents a column of binary values (of length n) representing the confidential attribute – Specify k (  3) – Let V (= V 1, V 2, …, V k ) represent k column vectors also of length n – Set V i = a – For each row in V Randomly set v j = (1 – a) (j ≠ i) Set all other values randomly as (0, 1) – For any query, select the appropriate rows in V, compute the values for each of these vectors; Response = Minimum and maximum of the computed values Since V i = a, the true response is guaranteed to be in interval

Example Every row consists of at least one “0” and one “1” Every confidential value is “represented” by the interval (0, 1) A simple example is shown on the right – n = 14 – k = 3 – V 3 = a Data is the same as that used by Garfinkel et al (2002) in their paper

Is CVC Deterministic? At first glance, CVC is not deterministic – Garfinkel, Gopal, Goes (2002, page 755) There is clearly a deterministic component since V 3 = a This deterministic component is necessary in order to satisfy the guarantee that every interval response will contain the true value

Responding to Queries

Query Based Attack Reconstructing V using brute force search – Select a small subset of the data of size m such that is within exponential computational capability. – Issue every possible query involving the records and store the corresponding responses. This results in a total of (2 m – 1) queries and responses. – Evaluate all possible (2 m ) combinations of values for a and identify candidate solutions for a that satisfy all responses from the previous step For the given data set, m = 14 is within computational capability. Perform search.

Search Result The search reproduces V (Candidate vector 1 = V 3 = a, Candidate vector 2 = V 1, and Candidate vector 3 = V 2 )

But is it disclosure? Every record still has a (0, 1); so is it disclosure? Suppose intruder knows a 2 = 0, the true value vector is immediately identified as candidate vector 1 Knowledge of one (or utmost two) records results in complete, exact disclosure

What if … We increase k? – Small increases in k have no impact on the reconstruction of V – In order to prevent reconstruction of V, it is necessary that k is close to 2 m – Increasing k also reduces the analytical validity since the interval is larger – Increasing k also increases storage and computational requirements

Computational Complexity Note that the search procedure is computationally feasible even if n is very large Since compromising m records is possible, we would then incrementally compromise the records in subsets of m Once subset m is revealed, the intruder can also compromise the remaining data using simple queries

Disclosure via Simple Queries All records can be progressively compromised Any response which is not of the form (0, cardinality) results in disclosure. But the response (0, cardinality) is useless for analytical purposes!

Insider Threat Protection CVC suggests an insider threat protection scheme which involves subtracting 1 (2) from the lower limit and adding 2 (1) to the upper limit But this insider threat protection is easily defeated by the intruder by – Either adjusting the responses – Or by using a base set and issuing queries incrementally using this base set to eliminate the “noise”

Summary In order to ensure that the true value is always contained in the response interval, it is necessary that V j = a – Using simple search, it is possible to reconstruct V Unless k is very large which creates other problems – Even if the search procedure fails, it is possible to compromise using responses to simple queries Hence, if the CVC method is implemented to protected binary data, the true confidential value vector a is subject to complete, exact disclosure

CVC for Numerical Data The confidential value vector a is now hidden among k vectors in P P does not contain the true value vector a For any given record: – Σ(ϒ j × P j i ) = a i (0.2 × 60) + (0.3 × 53) + (0.5 × 54.2) = 55 – 0 ≤ ϒ j ≤ 1 – Σϒ j = 1 Data is the same as that used by Gopal et al (2002) in their paper

Responses to Queries For simple sum and difference queries, the response is computed exactly as with the binary CVC method For more complex queries, it is necessary to solve a system of equations (linear or non-linear depending on the query) to compute the interval response For more details see Gopal, Garfinkel, and Goes (2002) We limit our discussion to sum and difference queries

Deterministic Component For numerical CVC, the true confidential value vector a is not a part of P However, the deterministic component of numerical CVC lies in the fact that Σ(ϒ j × P j i ) = a i Does this deterministic component lead to disclosure?

Computational Complexity We assume that the intruder knows that the true confidential value is integer Ignore last record since it is not protected Intruder issues queries relating to individual records and receives responses  These responses provide the respective upper and lower bounds for individual records  53 ≤ a 1 ≤ 60; 29 ≤ a 2 ≤ 32; …….; 91 ≤ a 13 ≤ 100 A total of 2,903,040,000 potential candidate solutions

Modified Search Procedure Select subset of the data (m = 5) – Identify candidate solutions – One of these candidate solutions must be true solution Incrementally add one more observation – The number of candidate solutions to be evaluated equals the (number of candidate solutions from previous step × number of possible integer values for the current observation) Repeat for all observations and identify candidate solutions

Result of Search Procedure Only three candidate solutions – One of these candidate solutions must be the true solution Assume intruder knows true value of a 1 = 55 The true value vector is immediately identified as Candidate solution 3 resulting in complete, exact disclosure

Compromise for Large Data Sets As with binary data, we can avoid the computational complexity by selecting small subsets However, for numerical CVC, knowledge of (k – 1) true values is adequate to compromise the entire data set since we can now solve a system of k equations and k unknowns resulting in knowledge of ϒ. With ϒ known, it is simple arithmetic to compute a

Assume that a 1 and a 2 are known Reconstruct P using the above responses

INFEASIBLE

Once P has been reconstructed, it is a simple matter of solving a set of equations to solve for ϒ. With this information, the remaining values can be compromised by issuing simple queries.

Conclusions Based on “traditional definition” of deterministic, CVC would not be classified as a deterministic procedure Deterministic guarantees always require that the masking approach have a deterministic component Any masking approach with a deterministic component is susceptible to complete, exact disclosure with knowledge of just a few true confidential values Remote access centers that contemplate the use of output perturbation approaches for answering ad hoc queries should consider the disclosure issue very carefully

Takeaway 1.The definition of “deterministic procedures” should be expanded to include any procedure that attempts to provide deterministic guarantees regarding responses to ad hoc queries 2.Just as procedures traditionally classified as deterministic are subject to complete exact disclosure with knowledge of a few values, procedures that offer deterministic guarantees are also subject to the same disclosure.