Download presentation
Presentation is loading. Please wait.
Published byTimothy Freeman Modified over 8 years ago
1
Secure Data Outsourcing
2
Outline Motivation Background Research issues Summary
3
Motivation Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well More and more data service providers Low cost – cloud computing Maintain one database for one user multiple users Examples: Alentus.com Datapipe.com Discountasp.net … Concerns about data security and privacy Untrusted service provider
4
Un-trusted service provider Lazy: incentives to perform less Curious: incentives to acquire information Malicious: Denial of service Incorrect results Possibly compromised
5
Challenges Data confidentiality Data need to be encrypted (?) Utility of protected data? Query utility Mining utility Access pattern privacy Integrity Data integrity Query integrity Correct Complete Fresh
6
Why is it hard for query services? Arbitrary expressivity SQL statements Often, restricted for certain type of query for simplicity (e.g. range query, knn query) Cost Communication Computation (server side vs client side)
7
Why it is hard for mining services? Many data mining models Different utilities to preserve No one-size-for-all solutions
8
Data confidentiality Bucketization method (crypto-index) Order preserving encryption Perturbations
9
Bucketization method Hacigumus (SIGMOD02)
10
Main steps Partition sensitive attributes Order preserving: supports comparison Random: query rewriting becomes hard Build index on the partitions Rewrite queries to target partitions ‘john doe’ 105 Select * from T’ where name=105 Execute queries and return results Prune/post-process results on client
11
Trade off between confidentiality and overhead Larger partition increased privacy increased overheads
12
Order preserving encryption Agrawal2004, Boldyreva2009 The set of data is securely transformed so that the order is preserved but the distribution and domain are changed Benefits: indexing/searching on OPE encrypted data Weakness: once the original distribution is known, OPE is broken
13
Not attribute-wise order preserving Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks Original Xi distribution is knownTransformed Xi’ distribution OPE Bucket based Estimation
14
Data perturbation Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved Techniques Single dimension: noise addition Multidimensional Geometric perturbation Random projection RASP random space perturbation
15
Noise addition Y = X+ R X: original data column, R: random noise (distribution published), Y: published data Applications in data mining Reconstructing column distribution Rakesh Agrawal SIGMOD 2000 Applied to privacy-preserving decision tree, naïve bayes classifier Attacks Spectral filtering (Kargupta ICDM 2004) PCA reconstruction (Huang SIGMOD2005)
16
Multiplicative perturbations Geometric data perturbation for outsourced data mining Random Projection RASP perturbation for query services (range query, kNN query).
17
Perturbation-based framework Mining service
18
Geometric data perturbation Y=RX+T+D R: secret rotation matrix (preserve Euclidean distances) T: secret random translation matrix, D: secret random noise matrix Distances are approximately preserved (D) Resilient to most attacks to rotation perturbation Applications Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms Attacks Population based attacks (when covariance matrix is revealed)
19
Random Projection Y=AX+D A: random projection, e.g., entries from N(0,1) Distances are approximately preserved Applications Many classification and clustering algorithms Worse accuracy than geometric perturbation Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record) Attacks Possibly more resilient than other two perturbation methods But utility (distance) is not well preserved
20
RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions -(K+1) th dimension is always 1 – homogeneous dimension -(K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records
21
Properties Not an OPE Preserves convexity of the dataset Convex dataset in R k another convex dataset in R k+2. Good for range query Each range query in R k hyperplane based query range query in R k+2.
22
RASP properties Convexity preserving Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron) w T x=a half space: w T x<=a The intersection of convex sets is also convex.
23
illustration of convexity preserving Original space Encrypted space
24
Secure query transformation A naïve solution Based on the convexity preserving property Problems: (1) A -1 can be probed (2) is.. If a is known, the whole dimension i is breached.
25
Secure query transformation Enhanced solution X k+2 is always positive (X i -a) 0 (X i -a)X k+2 0 Correspondingly, in the encrypted space y T y 0, Problems addressed: (1) A -1 cannot be derived from (2) (X i -a)X k+2 0 contains the random component X k+2 that protects the condition (X i -a) 0
26
Efficient two-stage query processing illustrated Original space Transformed space Stage1: Querying this bounding box A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server. Stage2: Filter out the junk records
27
Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions y T i y 0 for 1 … 2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering.
28
RASP-based data mining Preserving range query linear classifier Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)
29
Access pattern privacy On database queries Problem is the same as PIR Attackers may use the access pattern to breach data confidentiality Each of previous approaches should handle this problem!
30
PIR is impractical Solutions based on private Information retrieval (PIR) PIR is still impractical
31
For Bucktization approach Based on the architecture of Hacigumus (SIGMOD02) Hore VLDB04 For range query Privacy concern: reveal the distribution of value in each bucket “Diffusion”: split buckets and combine parts of different buckets Trade off: now the server needs to return more noisy results larger size
32
For OPE Use queries to find out the distributions, then break the encryption
33
For RASP Secure query transformation Attacks to transformed queries
34
Oblivious RAM Access pattern: read/write data items Setting: Client has a small secure memory Server has large insecure storage, semi- honest Data items are encrypted Client cannot hide the accessed locations An active area
35
Existing Approaches Inside a level Some real blocks Useful data Some dummy blocks Random data Randomly permuted Only the client knows the permutation Dummy Block Real Block Dummy Block Real Block Dummy Block Real Block
36
Existing Approaches Reading Read a block from each level One real block. Remaining are dummy blocks Client Server real dummy
37
Existing Approaches Writing Shuffle consecutively filled levels. Write into next unfilled level. Clear the source levels Server (before) Server (after) Client shuffle blocks
38
Continuous Shuffling … To write:
39
The Problem with Existing Approaches
40
Integrity guarantee Merkle hash tree H(H(x1)+H(x2)), + is string concatenation Can be stored with tree like structure : index, xml
41
Hash chains
42
Query correctness with merkle by Devanbu et. al.
43
Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11
44
Operations: Selections, projections, equijoins, set ops Issues Works only on data with verification objects Query expressiveness Expensive Related work Pang et. al (ICDE04, SIGMOD05), using ElGamal function Sion VLDB05: challenge token F.Li SIGMOD06: freshness
45
Secure keyword search Simple information retrieval For a keyword, find the documents containing the keyword What if the documents are encrypted word by word and if the keyword is also encrypted
46
Secure keyword search Song 2000 Seed is random, different for each Wi Key idea: Li and Ri are self- verifiable Advantage of XOR
48
How to set K?
49
Setting of ki Ki = Fk’(Wi), k’ is secret User publishes W and k = Fk’(W) Server checks CiW whether == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li.
50
Hidden search In previous schemes, W is revealed Weakness: each search will have to release k for W Easy to collect information Solution: encrypt Wi with an private key, then xor with
51
Recent developments Reza 2006 “Searchable symmetric encryption: improved definitions and efficient constructions” Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)
52
Trusted hardware
53
Possible benefits
54
Discussion Data confidentiality/access pattern Restrict cryptographic definition (keyword search) or Relaxed definition (perturbation, bucketization, OPE, etc.) It is very difficult to formulate and prove the security of non-traditional approaches Do we need to reformulate the security model? and how?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.