Secure Data Outsourcing. Outline  Motivation  Background  Research issues  Summary.

Secure Data Outsourcing

Outline  Motivation  Background  Research issues  Summary

Motivation  Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well  More and more data service providers Low cost – cloud computing  Maintain one database for one user  multiple users Examples:  Alentus.com  Datapipe.com  Discountasp.net  …  Concerns about data security and privacy Untrusted service provider

Un-trusted service provider  Lazy: incentives to perform less  Curious: incentives to acquire information  Malicious: Denial of service Incorrect results Possibly compromised

Challenges  Data confidentiality Data need to be encrypted (?) Utility of protected data?  Query utility  Mining utility  Access pattern privacy  Integrity Data integrity Query integrity  Correct  Complete  Fresh

Why is it hard for query services?  Arbitrary expressivity SQL statements Often, restricted for certain type of query for simplicity (e.g. range query, knn query)  Cost Communication Computation (server side vs client side)

Why it is hard for mining services?  Many data mining models Different utilities to preserve No one-size-for-all solutions

Data confidentiality  Bucketization method (crypto-index)  Order preserving encryption  Perturbations

Bucketization method  Hacigumus (SIGMOD02)

 Main steps Partition sensitive attributes  Order preserving: supports comparison  Random: query rewriting becomes hard Build index on the partitions Rewrite queries to target partitions  ‘john doe’  105  Select * from T’ where name=105 Execute queries and return results Prune/post-process results on client

 Trade off between confidentiality and overhead Larger partition  increased privacy  increased overheads

Order preserving encryption  Agrawal2004, Boldyreva2009  The set of data is securely transformed so that the order is preserved but the distribution and domain are changed  Benefits: indexing/searching on OPE encrypted data  Weakness: once the original distribution is known, OPE is broken

 Not attribute-wise order preserving Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks Original Xi distribution is knownTransformed Xi’ distribution OPE Bucket based Estimation

Data perturbation  Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved  Techniques Single dimension: noise addition Multidimensional  Geometric perturbation  Random projection  RASP random space perturbation

Noise addition  Y = X+ R X: original data column, R: random noise (distribution published), Y: published data  Applications in data mining Reconstructing column distribution  Rakesh Agrawal SIGMOD 2000  Applied to privacy-preserving decision tree, naïve bayes classifier  Attacks Spectral filtering (Kargupta ICDM 2004) PCA reconstruction (Huang SIGMOD2005)

 Multiplicative perturbations Geometric data perturbation for outsourced data mining Random Projection RASP perturbation for query services (range query, kNN query).

Perturbation-based framework Mining service

Geometric data perturbation  Y=RX+T+D R: secret rotation matrix (preserve Euclidean distances) T: secret random translation matrix, D: secret random noise matrix Distances are approximately preserved (D) Resilient to most attacks to rotation perturbation  Applications Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms  Attacks Population based attacks (when covariance matrix is revealed)

Random Projection  Y=AX+D A: random projection, e.g., entries from N(0,1) Distances are approximately preserved  Applications Many classification and clustering algorithms  Worse accuracy than geometric perturbation Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)  Attacks Possibly more resilient than other two perturbation methods But utility (distance) is not well preserved

RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions -(K+1) th dimension is always 1 – homogeneous dimension -(K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records

 Properties Not an OPE Preserves convexity of the dataset  Convex dataset in R k  another convex dataset in R k+2. Good for range query  Each range query in R k  hyperplane based query  range query in R k+2.

RASP properties  Convexity preserving Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron) w T x=a half space: w T x<=a The intersection of convex sets is also convex.

illustration of convexity preserving Original space Encrypted space

Secure query transformation  A naïve solution Based on the convexity preserving property Problems: (1) A -1 can be probed (2) is.. If a is known, the whole dimension i is breached.

Secure query transformation  Enhanced solution X k+2 is always positive (X i -a)  0  (X i -a)X k+2  0 Correspondingly, in the encrypted space y T y  0, Problems addressed: (1) A -1 cannot be derived from  (2) (X i -a)X k+2  0 contains the random component X k+2 that protects the condition (X i -a)  0

Efficient two-stage query processing  illustrated Original space Transformed space Stage1: Querying this bounding box A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server. Stage2: Filter out the junk records

Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions y T  i y  0 for  1 … 2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering.

RASP-based data mining  Preserving range query  linear classifier  Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)

Access pattern privacy  On database queries Problem is the same as PIR Attackers may use the access pattern to breach data confidentiality  Each of previous approaches should handle this problem!

PIR is impractical  Solutions based on private Information retrieval (PIR) PIR is still impractical

For Bucktization approach  Based on the architecture of Hacigumus (SIGMOD02)  Hore VLDB04 For range query Privacy concern: reveal the distribution of value in each bucket “Diffusion”: split buckets and combine parts of different buckets Trade off: now the server needs to return more noisy results  larger size

For OPE  Use queries to find out the distributions, then break the encryption

For RASP  Secure query transformation  Attacks to transformed queries

Oblivious RAM  Access pattern: read/write data items  Setting: Client has a small secure memory Server has large insecure storage, semi- honest Data items are encrypted Client cannot hide the accessed locations  An active area

Existing Approaches  Inside a level Some real blocks  Useful data Some dummy blocks  Random data Randomly permuted  Only the client knows the permutation Dummy Block Real Block Dummy Block Real Block Dummy Block Real Block

Existing Approaches  Reading Read a block from each level One real block. Remaining are dummy blocks Client Server real dummy

Existing Approaches  Writing Shuffle consecutively filled levels. Write into next unfilled level. Clear the source levels Server (before) Server (after) Client shuffle blocks

Continuous Shuffling  … To write:

The Problem with Existing Approaches 

Integrity guarantee  Merkle hash tree H(H(x1)+H(x2)), + is string concatenation Can be stored with tree like structure : index, xml

 Hash chains

Query correctness with merkle by Devanbu et. al.

Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11

 Operations: Selections, projections, equijoins, set ops  Issues Works only on data with verification objects Query expressiveness Expensive  Related work Pang et. al (ICDE04, SIGMOD05), using ElGamal function Sion VLDB05: challenge token F.Li SIGMOD06: freshness

Secure keyword search  Simple information retrieval For a keyword, find the documents containing the keyword  What if the documents are encrypted word by word  and if the keyword is also encrypted

Secure keyword search  Song 2000 Seed is random, different for each Wi Key idea: Li and Ri are self- verifiable Advantage of XOR

How to set K?

 Setting of ki Ki = Fk’(Wi), k’ is secret User publishes W and k = Fk’(W) Server checks CiW  whether == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li.

Hidden search  In previous schemes, W is revealed  Weakness: each search will have to release k for W  Easy to collect information  Solution: encrypt Wi with an private key, then xor with

Recent developments  Reza 2006 “Searchable symmetric encryption: improved definitions and efficient constructions” Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)

Trusted hardware

Possible benefits

Discussion  Data confidentiality/access pattern Restrict cryptographic definition (keyword search) or Relaxed definition (perturbation, bucketization, OPE, etc.)  It is very difficult to formulate and prove the security of non-traditional approaches Do we need to reformulate the security model? and how?

Secure Data Outsourcing. Outline  Motivation  Background  Research issues  Summary.

Similar presentations

Presentation on theme: "Secure Data Outsourcing. Outline  Motivation  Background  Research issues  Summary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Secure Data Outsourcing. Outline  Motivation  Background  Research issues  Summary.

Similar presentations

Presentation on theme: "Secure Data Outsourcing. Outline  Motivation  Background  Research issues  Summary."— Presentation transcript:

Similar presentations

About project

Feedback