Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel.

Slides:



Advertisements
Similar presentations
Submodular Set Function Maximization via the Multilinear Relaxation & Dependent Rounding Chandra Chekuri Univ. of Illinois, Urbana-Champaign.
Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
A survey of some results on the Firefighter Problem Kah Loon Ng DIMACS Wow! I need reinforcements!
Dr. Sana’a Wafa Al-Sayegh
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Combinatorial Algorithms
Infinite Horizon Problems
Planning under Uncertainty
ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.
Greedy Algorithms for Matroids Andreas Klappenecker.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Recent Development on Elimination Ordering Group 1.
Math443/543 Mathematical Modeling and Optimization
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
The Theory of NP-Completeness
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
An Algebraic Algorithm for Weighted Linear Matroid Intersection
Chapter 11: Limitations of Algorithmic Power
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
CSE 550 Computer Network Design Dr. Mohammed H. Sqalli COE, KFUPM Spring 2007 (Term 062)
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Radial Basis Function Networks
Game Theory.
1. The Simplex Method.
ENCI 303 Lecture PS-19 Optimization 2
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
Section 4.1 Vectors in ℝ n. ℝ n Vectors Vector addition Scalar multiplication.
Linear Programming System of Linear Inequalities  The solution set of LP is described by Ax  b. Gauss showed how to solve a system of linear.
Pareto Linear Programming The Problem: P-opt Cx s.t Ax ≤ b x ≥ 0 where C is a kxn matrix so that Cx = (c (1) x, c (2) x,..., c (k) x) where c.
1 Greedy algorithm 叶德仕 2 Greedy algorithm’s paradigm Algorithm is greedy if it builds up a solution in small steps it chooses a decision.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Approximation Algorithms
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Greedy Algorithms and Matroids Andreas Klappenecker.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
I.4 Polyhedral Theory 1. Integer Programming  Objective of Study: want to know how to describe the convex hull of the solution set to the IP problem.
Daniel O. Rice Loyola College in Maryland (with Robert Garfinkel and Ram Gopal University of Connecticut) The Protection of Numerical Information in Databases.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Maximizing Symmetric Submodular Functions Moran Feldman EPFL.
1 The instructor will be absent on March 29 th. The class resumes on March 31 st.
Linear Programming Chap 2. The Geometry of LP  In the text, polyhedron is defined as P = { x  R n : Ax  b }. So some of our earlier results should.
Approximation Algorithms based on linear programming.
CSE 330: Numerical Methods. What is true error? True error is the difference between the true value (also called the exact value) and the approximate.
1 Chapter 4 Geometry of Linear Programming  There are strong relationships between the geometrical and algebraic features of LP problems  Convenient.
CS 9633 Machine Learning Support Vector Machines
8.3.2 Constant Distance Approximations
Design and Analysis of Algorithm
Spatial Online Sampling and Aggregation
James B. Orlin Presented by Tal Kaminker
Chapter 6. Large Scale Optimization
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CUBE MATERIALIZATION E0 261 Jayant Haritsa
I.4 Polyhedral Theory (NW)
I.4 Polyhedral Theory.
Chapter 6. Large Scale Optimization
Guess Free Maximization of Submodular and Linear Sums
Presentation transcript:

Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel Nunez, Robert Garfinkel, and Ram Gopal JHU October 5, 2006

Motivation Two general goals for a statistical database: Protect confidential records Provide useful information The goals are often in conflict -> tradeoff Problems faced by Census Bureaus, etc.

An Example RecordNameJobAgeCo.Salary 1RobinsonManager27A55 2ReeseTrainee42B31 3FurilloManager63C107 4CampanellaTrainee28B 5CoxManager55B63 6SniderManager57A82 7KoufaxTrainee21D29 8NewcombeTrainee32C31 9HodgesManager35D60 10BrancaTrainee36D27 11LoesManager47B37 12RoeTrainee28D42 13ReiserManager64A94 14GilliamManager46C51

Types of Protection Random data perturbation Add noise to data, answer all queries Query restriction/inference control Provide exact answers to some queries, but refuse to answer others Keep track of answered queries (auditing) Camouflage/interval methods Provide interval answer to queries Answer all queries

Exact Disclosure DB is [2, 5, 8], two target SUM queries: q 1 = [1, 1, 1], answer is 15 q 2 = [1, 1, 0], answer is 7 User can solve system And learn that a 3 = 8

Another Perspective Notice that a linear combination of q 1 and q 2 yields the canonical vector e3 = [0, 0, 1] Namely, q 1 – q 2 = e 3 In general, a group of linear queries is “exactly safe” if none of the canonical vectors can be expressed as a linear combination of the queries in the group

Degrees of Disclosure Exact disclosure User is unable to learn the exact confidential value of any subject Interval disclosure User is unable to learn that confidential value is within a subject pre-specified interval Stochastic disclosure User is unable to randomly estimate a confidential value with high probability

Previous Work on Query Restriction: J.O.C. Interval disclosure SUM and MIN (MAX) queries Determine heuristic restriction of the polytope that describes the user’s knowledge. Use it to decide whether to answer the next query. Would it result in a “safe” polytope? Collusion, but not auditing, is a problem “Success” is a function of no. of answered queries

QR continued Queries arrive online Decisions are made without knowing what comes next If all queries were known, finding a maximum cardinality set to answer is NP-Hard (Chin & Ozsoyoglu).

Previous work on Camouflage (CVC): Operations Research Hide confidential vector in the interior of a “safe” polytope Π. Answer all queries q=f(x) with the interval [min f(x), max f(x), x ε Π] Answers are deterministically correct Depending on the query type, finding polynomial, minimum access algorithms is not trivial!

CVC Continued A set of linear queries can be predetermined to safely yield exact answers via a network flow formulation Collusion is not a problem. The same cannot be said of “insider information”

Current extensions of CVC Dealing with insider threats “Data” vs. “Process” Finding the best (smallest) camouflaging set based on the threat type and level Is it necessarily a polytope?

Hybrid of Query Restriction and Data Perturbation Provide an algorithm to determine which queries (from a given set) can be exactly answered without compromising confidentiality (safe subset) Provide a protection mechanism to answer all other queries Maintain consistency of exact answers and protected answers

Our Approach Given a target set of weighted queries, follow a 3-phase process: 1.Find the maximum weight query subset that can be safely answered exactly 2.For other target queries, answer safe approximate queries exactly (optional) 3.Answer all other non-target queries using a consistent perturbed DB

Importance of Consistency Suppose q is answered exactly. In the absence of consistency a user who wants to determine a i can ask a series of queries q´= q + e i to get a set of i.i.d. estimates of a i As the number of such queries gets large the error in the resulting estimate of a i goes to zero.

What is Given to the User? Guaranteed exact answers to safe target queries Public answers imply no threat from user collusion Approximate answers to unsafe target queries This way, we ensure some degree of information for all target queries Access to a perturbed DB for all other non- target query

Model Assumptions DB has n subjects Only one confidential field: a є R n (could be a stacking of any number of such fields) Every subject is identifiable by the record index Set of subject indexes: N, |N| = n Queries have nonnegative weights

Phase 1: Query Restriction Set of target queries: T Query weights: w Index set to queries in T: M, |M| = m Sum of weights for K subset of M:

Phase 1 Optimization Problem Problem OPT: Where F is a family of “safe” subsets But before defining a safe set, let’s talk about matroids …

Matroids Modeling theory founded by H. Whitney, 1935 Many applications in combinatorial optimization: Maximal spanning tree Matroid intersection Maximal partition/matching Etc

Quick Definition Matroid is a pair (M, F): M is a finite set, F is a family of subsets of M Elements of F are called “independent” sets Two properties: If K is in F, then all subsets of K are in F If K and L are in F, |K| = |L| + 1, then one element of K can be added to L to create a new independent set Rank of K, r(K), is the cardinality of largest independent set in K

Example: MST All sub-trees are independent sets Matroid is the collection of sub-trees The rank of a subgraph is the number of links of the largest tree in the subgraph

Example: Sets of L.I. vectors Find a linear basis from a matrix The matroid consists of subsets of linearly independent columns A basis is an independent set of maximum cardinality Rank of a submatrix is the column-rank of the submatrix

Non example Consider an Assignment Problem A set of cells is independent if no row or column appears more than once. Seems to be almost a matroid but it’s not!

Main Matroid Result Given a set of non-negative weights assigned to the elements of M If (M, F) is a matroid, then the Greedy algorithm will find an independent set (i.e. a set in F) that maximizes the sum of the weights

Matroid Intersection Given k matroids (M, F 1 ), …, (M, F k ) and weights for the elements of M, the goal is to find a common independent set that maximizes the sum of the weights Problem: intersection of matroids is not a matroid For general k, the problem is NP-Hard Yet, a modified greedy algorithm works for intersection of 2 matroids

Matroid and Inference Given target query set T, let M be the indexes to the queries A subset K of M is safe w.r.t. subject i if the user cannot learn subject i’ s confidential record using linear combinations of queries with index set K Let F i be the safe subsets of M w.r.t. subject i Then, (M F i ) is a matroid! A safe set is safe w.r.t. all subjects, that is, is in the matroid intersection

Examples of Safe Sets Four target queries:

Independent (Safe) Sets

Rank Evaluation

Approximate Solutions to OPT Matroid intersection greedy (MIG) algorithm: Start with full index set M 1 = M At iteration t+1, remove one index from M t to create set M t+1 Remove index that minimizes the ratios: Stop when M t becomes a safe set

More About MIG Denominator  f j roughly counts in how many additional matroids the set M t+1 will become safe In other words, the best index to remove is chosen so that its weight is low and it will make safe the set M t+1 for many matroids MIG will finish in no more than m iterations, and each iteration can be done in O(m 3 n 2 ) operations

Approximation Error Set obtained from MIG: K, M \ K is safe Z is the optimal value of OPT Nemhauser + Wolsey bounds: H(d) is the harmonic number:

Example K = {2, 3}, M\K = {1, 4} K* = {2, 3, 4}, W(K*) = 40 Bounds: 20 < Z < 40.4

Phase 2: Additional Safe Answers Set S is the chosen set of exact answer queries What to do about a query q in T\S? Answer a query “close to” q Order queries in T\S according to weight For instance, if q is a sum query, answer a safe query with smaller query size Or, answer the closest query to q that is a linear combination of the queries in S

Phase 3: Constrained Perturbation Goal: Answer all queries with perturbed data a +  a making sure that answers are consistent with target queries Two almost equivalent methods: Perturb and project onto query hyperplane Perturb on the hyperplane direction

Perturb & Project

Directional Perturbation

Extending Protection What to do to provide interval protection? What to do to provide stochastic protection from exact answers and from the perturbation?

Program G3LP Let Q be a matrix whose columns are the exact answer queries Consider linear program G3LP, i є N:

Interval Disclosure If z i * = u i * - l i * is optimal to G3LP, then the user will know Interval disclosure occurs when Where  is chosen by subject i

Stochastic Disclosure Let X i be a random estimation of a i Let l and u be known bounds on a i For  and  > 0, a i is protected if That is, a i cannot be randomly estimated in any interval of range  or smaller with probability  or higher

Protection against stochastic threat from deterministic answers Before perturbation phase, systematically remove queries from exact answer set until the following condition holds for all subjects

continued The problem of which queries to be removed is also hard. A greedy heuristic gives similar bounds to those of Phase 1.

Stochastic threat from Perturbation Based on the perturbation, confidence intervals on a i can be obtained from Chebyshev’s inequality. Solution is to generate a sequence of i.i.d. perturbations until a safe one is found.

Numerical results Results are very encouraging. Large numbers of queries answered exactly Development of a test bank was difficult because of the problem of finding optimal solutions. A class of interesting problems was found for which those solutions were easily determined.