Data Security and Privacy Keke Chen

Data Security and Privacy Keke Chen
Cloud Computing Data Security and Privacy Keke Chen

Outline Data confidentiality Data privacy access pattern privacy
Integrity of cloud processing

Private data in cloud computing
To use cloud services, the users may have to outsource data Except for public services, such as maps, search engine Data can be relational databases, multidimensional vector data, graph data, text data, images, video…

Problem with outsourced data
Data confidentiality Threat model External attackers encryption is sufficient Internal attackers Cannot work on data encrypted with existing techniques New methods are needed

Problem Access pattern
queries, query results, and how the query is processed the user’s query might be private The returned data, the access path may reveal data distributions – used to compromise data confidentiality Threat model Internal attackers

Problem Query correctness Is the returned result the correct result?
Example: Outsourced database services content distribution network Threat models Malicious service providers Honest but lazy service providers

Normal Assumption on the attackers
Honest but curious service providers Honest – do what you want it to do Curious – want to peek at your information Applies to data confidentiality and access pattern Query correctness assumes dishonest providers To reduce the service cost Intentionally change content

Data confidentiality The ultimate solution: fully homomorphic encryption Computation on encrypted data directly, without the need to decrypt the data Example E(X) ◊ E(Y) = E(X+Y) : Pailier E(X) ◊ E(Y) = E(X*Y) : ElGamal, RSA Both ‘+’ and ‘*’: Gentry’s method (Gentry 2009) Problem: efficiency google search with encrypted keywords and FHE would increase the cost by about a trillion times

Data confidentiality keyword search for encrypted documents (Song 2000) Basic idea Linear scan through the text and try to verify whether each encrypted word is matched Only the searched word can be correctly decrypted. Other words will generate random code Problem Linear scan, not efficient for large data Privacy of search is not preserved – the service provider knows Special documents might be identified

Secure keyword search Song 2000 (paper 186)
Seed is random, different for each Wi Key idea: Li and Ri are self- verifiable Advantage of XOR

How to set K?

Data confidentiality Public-key keyword search and range search (Shi 2007) Scenario: store encrypted data in public storage. Only authorized users can access the authorized part of the data Example Audit log: authorized users access certain days of the log (range query) Problem: key management The range of a searchable dimension is discretized Each value needs a pair of public-private keys Still linear scan

Data confidentiality For many data intensive applications Problem
Linear scan is not sufficient Need indices to make search faster, for example OLAP k nearest search in geological/multimedia databases Problem Indexable encrypted data

Data confidentiality Indexable encryption on relational data
Order preserving encryption Map a column of data to arbitrary distribution, while preserving the value order Able to build index on the column (order is preserved) Easy to translate queries to the encrypted data Crypto-index Map a column of data to buckets Give random ID to the buckets Values are represented with the bucket IDs Range search becomes a search on a list of bucket IDs

Problems with OPE and crypto-index
Weak threat model: assume attackers have no knowledge about data distribution Distributional attack Assume the attacker knows the column distribution: easily map OPE encrypted column back to the original column Crypto-index: Attacking queried ranges  derive the mapping between bucket-id and real value range Return junk records (crypto-index) to hide the column distribution, have to make each bucket equi-height (i.e., same number records in each bucket)

Weaker data confidentiality
Privacy protection Do not hide the complete information Attackers may estimate but cannot get exact values Data utility is easier to preserve Techniques Data perturbation Data anonymization

Data perturbation Definition Techniques
1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved Techniques Single dimension: noise addition Multidimensional Geometric perturbation Random projection

Noise addition Y = X+ R Applications in data mining Attacks
X: original data column, R: random noise (distribution published), Y: published data Applications in data mining Reconstructing column distribution Rakesh Agrawal SIGMOD 2000 Applied to privacy-preserving decision tree, naïve bayes classifier Attacks Spectral filtering (Kargupta ICDM 2004) PCA reconstruction (Huang SIGMOD2005)

Geometric data perturbation
Y=RX+T+D R: secret rotation matrix (preserve Euclidean distances) T: secret random translation matrix, D: secret random noise matrix Distances are approximately preserved (D) Resilient to most attacks to rotation perturbation Applications Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms Attacks Population based attacks (when covariance matrix is revealed)

Random Projection Y=AX+D Applications Attacks
A: random projection, e.g., entries from N(0,1) Distances are approximately preserved Applications Many classification and clustering algorithms Worse accuracy than geometric perturbation Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record) Attacks Possibly more resilient than other two perturbation methods

Random Space Perturbation
y = A( E(x), 1, v)T x is a k-D vector, E(x): order preserving encryption, v: random value A : (k+2)x(k+2) random invertible matrix Properties Preserve half space queries, e.g., f(x)<0 Does not preserve distance, orders, (resilient to the known attacks to OPE, GDP, etc.) Applications: range query, data mining (linear classifier + boosting)

Data anonymization Quasi Identifier anonymization problem
Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis 87 % of US population Name SSN Visit Date Diagnosis Procedure Medication Total Charge Name Address Date Registered Party affiliation Date last voted Zip Birth date Sex Zip Birth date Sex Quasi Identifier Voter List Medical Data Sweeney, IJUFKS 2002 23

Basic problems in anonymization
Linkage between Records Attributes Tables Attacker’s prior knowledge and published data Current research address possible data linkage attacks Propose techniques to prevent them A nice survey paper

Differential privacy Statistical database Perturb the query results
Allow users to submit aggregate queries But need to protect from query inference attacks Perturb the query results Cynthia Dwork ( proposed differential privacy A method for designing the random noise for user-specific “privacy budget” and specific aggregate functions.

Privacy of access pattern
Do not want the attacker know what items the user is accessing Known access pattern can be used to derive a lot of private information Example: AOL search log  identify real person  what sensitive keywords he/she used naive solution: download the entire database 

Privacy of access pattern
Private information retrieval (PIR) Basic idea: hide the real access pattern among a lot of accesses Multi-server information theoretic PIR (Chor 98) Computational PIR – same assumption as computationally secure encryption (Kushilevitz 97) Problem: High communication cost: need to transfer a lot of contents Most efficient solution still needs O(n1/2) communication cost

Query correctness check whether server faithfully returns the results
Challenge token (Sion 2005) put a challenge query (that the user knows the answer) among a batch of b queries This solution only provides a probabilistic guarantee on query correctness

Use Merkle hash tree (Li 2006)
Integrate Merkle hash tree with B-tree index For each search the service provider need to provide a hash value to prove the right access path was used Client can verify the hash with its own merkle hash tree

Proof of retrievability
Problem Archived or backup data in the cloud, sporadically accessed Make sure the data is still there (not discarded by cloud provider for saving money) Make sure the data is not modified Approaches Similar to challenge based methods

Data Security and Privacy Keke Chen

Similar presentations

Presentation on theme: "Data Security and Privacy Keke Chen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Security and Privacy Keke Chen

Similar presentations

Presentation on theme: "Data Security and Privacy Keke Chen"— Presentation transcript:

Similar presentations

About project

Feedback