Big Data Analytics over Encrypted Datasets with Seabed

Slides:



Advertisements
Similar presentations
Monomi: Practical Analytical Query Processing over Encrypted Data
Advertisements

Querying Encrypted Data using Fully Homomorphic Encryption Murali Mani, UMFlint Talk given at CIDR, Jan 7,
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Trustworthy Services from Untrustworthy Components: Overview Fred B. Schneider Department of Computer Science Cornell University Ithaca, New York
Cryptography. 2 Objectives Explain common terms used in the field of cryptography Outline what mechanisms constitute a strong cryptosystem Demonstrate.
CryptDB: Protecting Confidentiality with Encrypted Query Processing
CryptDB: Confidentiality for Database Applications with Encrypted Query Processing Raluca Ada Popa, Catherine Redfield, Nickolai Zeldovich, and Hari Balakrishnan.
CryptDB: A Practical Encrypted Relational DBMS Raluca Ada Popa, Nickolai Zeldovich, and Hari Balakrishnan MIT CSAIL New England Database Summit 2011.
Introduction to Cryptography and Security Mechanisms: Unit 5 Theoretical v Practical Security Dr Keith Martin McCrea
Asymmetric Cryptography part 1 & 2 Haya Shulman Many thanks to Amir Herzberg who donated some of the slides from
 Relational Cloud: A Database-as-a-Service for the Cloud Carlo Curino, Evan Jones, Raluca Ada Popa, Nirmesh Malaviya, Eugene Wu, Sam Madden, Hari Balakrishnan,
Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.
Hybrid Cipher encryption Plain Text Key Cipher Text Key Plain Text IV Hybrid Cipher decryption Hybrid Cipher Note: IV used in encryption is not used in.
Database Key Management CSCI 5857: Encoding and Encryption.
Practical Techniques for Searches on Encrypted Data Yongdae Kim Written by Song, Wagner, Perrig.
1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.
Jim McLeod MyDBA  SQL Server Performance Tuning Consultant with MyDBA  Microsoft Certified Trainer with SQLskills Australia 
Secure Cloud Database. Introduction Cloud computing – IT as a service from third party service provider Security in cloud environment – Adversary corrupts.
Mohammad Ahmadian COP-6087 University of Central Florida.
Secure Cloud Database using Multiparty Computation.
Chapter 20 Symmetric Encryption and Message Confidentiality.
Public Key Encryption and the RSA Public Key Algorithm CSCI 5857: Encoding and Encryption.
Cryptography, Authentication and Digital Signatures
Wai Kit Wong 1, Ben Kao 2, David W. Cheung 2, Rongbin Li 2, Siu Ming Yiu 2 1 Hang Seng Management College, Hong Kong 2 University of Hong Kong.
Wai Kit Wong, Ben Kao, David W. Cheung, Rongbin Li, Siu Ming Yiu.
Secure Cloud Database with Sense of Security. Introduction Cloud computing – IT as a service from third party service provider Security in cloud environment.
Secure Cloud Database. Introduction Cloud computing – IT as a service from third party service provider Security in cloud environment – Adversary corrupts.
SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.
Cryptography 1 Crypto Cryptography 2 Crypto  Cryptology  The art and science of making and breaking “secret codes”  Cryptography  making “secret.
A Hybrid Technique for Private Location-Based Queries with Database Protection Gabriel Ghinita 1 Panos Kalnis 2 Murat Kantarcioglu 3 Elisa Bertino 1 1.
Data Integrity Proofs in Cloud Storage Author: Sravan Kumar R and Ashutosh Saxena. Source: The Third International Conference on Communication Systems.
CryptDB: Protecting Confidentiality with Encrypted Query Processing
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Secure Data Outsourcing
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
MPC Cloud Database with Sense of Security. Introduction Cloud computing – IT as a service from third party service provider Security in cloud environment.
Symmetric-Key Cryptography CS 161: Computer Security Prof. Raluca Ada Popa Sept 13, 2016.
Practical Private Range Search Revisited
Application Security Lecture 27 Aditya Akella.
Searchable Encryption in Cloud
Rekeying for Encrypted Deduplication Storage
Data Platform and Analytics Foundational Training
Attacks on Public Key Encryption Algorithms
Security in Outsourcing of Association Rule Mining
Asymmetric-Key Cryptography
Efficient Multi-User Indexing for Secure Keyword Search
Antonis Papadimitriou, Arjun Narayan, Andreas Haeberlen
Boneh-Franklin Identity Based Encryption Scheme
Public Key Encryption and Digital Signatures
State of the art – Part 1 Xin Jin
Fast Searchable Encryption with Tunable Locality
Using cryptography in databases and web applications
Digital Signatures Last Updated: Oct 14, 2017.
Cryptography and Security Fall 2009 Steve Lai
Real-world Security of Public Key Crypto
End to End Security and Encryption in SQL Server
بررسی معماری های امن پایگاه داده از جنبه رمزنگاری
Lecture 10: Network Security.
Malicious-Secure Private Set Intersection via Dual Execution
Privacy preserving cloud computing
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Introduction to Cryptography (1)
SQL Server 2016 Security Features
Introduction to Cryptography
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Security: Public Key Cryptography
Some contents are borrowed from Adam Smith’s slides
Review of Cryptography: Symmetric and Asymmetric Crypto Advanced Network Security Peter Reiher August, 2014.
Presentation transcript:

Big Data Analytics over Encrypted Datasets with Seabed Antonis Papadimitriou✳, Ranjita Bhagwan✩, Nishanth Chandran ✩, Ramachandran Ramjee ✩, Andreas Haeberlen ✳, Harmeet Singh ✩, Abhishek Modi ✩, Saikrishna Badrinarayanan✯ ✳University of Pennsylvania, ✩Microsoft Research India, ✯UCLA

Motivation: Big data analytics on sensitive data Analyst Retail business Cloud Provider Ah, now I know everyone’s purchases! customer gender country payment Alice female CAN 12 Bob male USA 4 Charlie 1 Deborah 15 What is the total revenue in the USA? Goal: Outsource big data analytics Store database at a cloud provider Perform analytical queries remotely Problem: Rogue cloud admins or hackers could have access to data Sensitive information can be exposed

Prior work: Encrypted databases Analyst Retail business Cloud Provider customer gender country payment %Th6j h4$89 548yvg 439856 Fjg893 sfbg43 a3vbt9a 582650 %gTHR 143759 34%^d 874563 advertiser isFraud country revenue Alpha Co. legit CAN 12 Beta Inc. suspicious USA 4 Psi Int’l 1 Omega Ltd 15 What can we do? Use encryption! Examples: CryptDB/Monomi [SOSP11, VLDB13], MS SQL Server [SQL16] These support SQL queries on encrypted data

Encrypted databases – Challenges Analyst Retail business 3800ns 1ns Encrypted Plaintext Cost of addition (single core) Cloud Provider customer gender country payment %Th6j h4$89 548yvg 439856 Fjg893 sfbg43 a3vbt9a 582650 %gTHR 143759 34%^d 874563 Challenge 1: Performance Aggregations on encrypted data are slower Ciphertext addition is > 3000x slower than plaintext Adding 100 million values takes 6 minutes instead of 100ms Not good for big data! More coffee breaks!

Encrypted databases – Challenges Analyst Retail business Cloud Provider customer gender country payment %Th6j h4$89 548yvg 439856 Fjg893 sfbg43 a3vbt9a 582650 %gTHR 143759 34%^d 874563 Challenge 2: Security Encrypted databases use cryptographic schemes with weaker guarantees Example: deterministic encryption (DET) reveals equality Recent attack [CCS15] recovered > 60% from certain DET columns

Our approach Goal 1: Improve performance Goal 2: Improve security ASHE gender male h4$89h rb%Gj4 kb3i&Q Epvi#$R payment male 987242 459239 593292 742063 customer %Th6j& Fjg893n %gTHR3 34%^db gender h4$89h sfbg43q revenue 987242 629459 593292 742063 payment 439856 582650 143759 874563 customer %Th6j& Fjg893n %gTHR3 34%^db payment female 654929 629459 243995 592623 gender female gfv941E sfbg43q H7&fgh1 D23gr$w ASHE SPLASHE Goal 1: Improve performance ASHE – New cryptographic scheme that allows fast aggregation on encrypted data Goal 2: Improve security SPLASHE: DB transformation that enables more queries without using weaker crypto

Seabed: Big data analytics for encrypted datasets SPLASHE ASHE Analyst SEABED We built Seabed on top of Spark Seabed leverages ASHE and SPLASHE Seabed runs SQL queries on encrypted data Examples: Group-by queries and aggregations (sum, average, variance) Seabed is fast enough for big data Up to 100x faster than previous systems

Outline Motivation & prior work Approach Improving performance ASHE Improving security SPLASHE System design Evaluation

Why is aggregation slow in encrypted databases? Plaintext DB Encrypted DB payment 12 4 1 15 payment 439856 582650 143759 874563 ⊕ Homomorphic addition Integer addition + Sum = 32 Sum = Enc(32) We need to sum up encrypted data This is impossible with schemes like AES We need an additively homomorphic cryptosystem Example: Paillier encryption [EUROCRYPT99] 𝐸𝑛𝑐 𝑥 1 ⊕𝐸𝑛𝑐 𝑥 2 =𝐸𝑛𝑐( 𝑥 1 + 𝑥 2 )

Why is aggregation slow in encrypted databases? Plaintext DB Encrypted DB payment 12 4 1 15 payment 439856 582650 143759 874563 ⊕ Homomorphic addition Integer addition + Sum = 32 Sum = Enc(32) Most homomorphic cryptosystems are expensive! Example: Paillier ciphertexts need to be 2048-bit Homomorphic addition: 𝐸𝑛𝑐 𝑥 1 ⊕𝐸𝑛𝑐 𝑥 2 =𝐸𝑛𝑐 𝑥 1 ∗𝐸𝑛𝑐( 𝑥 2 ) > 3000x slower than plain addition

Can we have faster homomorphic cryptosystems? Retail business Public key Symmetric key Analyst Private key But why does Paillier have so large ciphertexts? Because it is an asymmetric scheme based on large integers Encrypt with public key – decrypt with private key Do we need asymmetric crypto in outsourced databases? Analysts and data collector usually work for the same organization We could use fast symmetric crypto!

ASHE – Additive Symmetric Homomorphic Encryption Plaintext DB Encrypted DB Encrypted DB payment 12 4 1 15 payment 12+439 4 - 56 1 + 379 15+763 ID 1 2 3 4 payment 12+F(1) 4 + F(2) 1 + F(3) 15+F(4) Sum = 32 + 1525 Sum = 32 + 1525 ID list: {1, 2, 3, 4} - 1525 Encrypt by masking values with random numbers ASHE is semantically secure (IND-CPA) No need to remember random numbers Use pseudorandom function F(ID) ASHE ciphertexts are 32/64-bit integers Homomorphic addition only takes a few nanoseconds!

ASHE – Optimizations Encrypted DB Sum = 32 + 𝐹(𝑖) ID list: {1, 2, 3, 4} Aggregation Decryption Compute F 1 , F 2 , F 3 , F(4) Decrypt: Sum - 𝐹(𝑖) ID 1 2 3 4 payment 12 + F(1) - F(0) 4 + F(2) - F(1) 1 + F(3) - F(2) 15 + F(4) - F(3) payment 12 + F(1) 4 + F(2) 1 + F(3) 15 + F(4) F F AES-NI Challenge: Aggregation and decryption cost depends on ID list length Optimizations: Optimize encryption so that the randomness cancels out for consecutive IDs Fast evaluation of pseudorandom function via AES-NI Compression techniques to make ID list as small as possible Outcome: ASHE enables fast aggregation even when the DB is very large

Outline Motivation & prior work Approach Improving performance ASHE Improving security SPLASHE System design Evaluation

Why are encrypted databases vulnerable? Auxiliary information customer gender Alice female Bob male Charlie Deborah customer gender %Th6j& h4$89 Fjg893n sfbg43 %gTHR3 34%^db 3 1 h4$89h sfbg43q 69% 31% female male + h4$89h = female sfbg43q = male Some columns use deterministic encryption (DET) This leaks the distribution of values An adversary with auxiliary information can do a frequency attack [CCS15] In the example, the gender is revealed

How can we avoid deterministic encryption? SELECT sum (revenue) FROM purchases WHERE gender = “female” customer gender payment Alice female 12 Bob male 4 Charlie 1 Deborah 15 customer gender female male payment %Th6j& 476529 459220 439856 314437 Fjg893n 956204 953265 582650 207465 %gTHR3 529482 234599 143759 958922 34%^db 459283 562087 874563 996324 customer gender female male payment Alice 1 12 Bob 4 Charlie Deborah 15 Sum = 28 Support single-table aggregation queries without DET SPLASHE: Transform DB schema to avoid DET Answer single-table aggregation queries using additions only Some storage overhead Reduced by Enhanced SPLASHE (see paper)

Seabed – System design SEABED Analyst Proxy We implemented Seabed on top of unmodified Spark ASHE and SPLASHE implemented in Scala Seabed’s high-level design is similar to CryptDB’s Accepts SQL queries; transparently answers them on encrypted data Client proxy handles encryption/decryption

Outline Motivation & prior work Approach Improving performance ASHE Improving security SPLASHE System design Evaluation

Evaluation: Questions End-to-end latency of aggregation? Storage overhead of SPLASHE? End-to-end latency in Bing Ads analytics? How scalable is aggregation? How effective are Seabed’s optimizations? Latency of group-by queries? Latency of batch queries (Big Data Benchmark)? End-to-end latency of aggregation? Storage overhead of SPLASHE? End-to-end latency in Bing Ads analytics? How scalable is aggregation? How effective are Seabed’s optimizations? Latency of group-by queries? Latency of batch queries (Big Data Benchmark)? Experimental setup: Spark with 100 cores On MS Azure Memory-resident data

How efficient is ASHE aggregation? Paillier No Enc. 1200 1000 800 600 400 200 500 1500 2000 Dataset size (millions of rows) End-to-end time (s) 5 10 15 500 1000 1500 2000 Dataset size (millions of rows) End-to-end time (s) 16.6 min No Enc. 10 sec Seabed (Worst) Seabed (Best) 1 sec Synthetic data: up to 1.75 billion rows - Query: single column aggregation Results Paillier: up to 16.6 minutes No encryption: <1 second How does Seabed do? Seabed is 100x faster than Paillier, even in the worst case!

How much storage does SPLASHE need? DET columns replaced with SPLASHE 1 Col 5 Cols 10 Cols 10x 100x Size increase relative to plaintext 1000x SPLASHE Enhanced SPLASHE Dataset 760M rows, real ad-analytics application from Microsoft We replaced 10 DET columns with SPLASHE, one by one Measured: Relative size increase vs. plaintext dataset Results SPLASHE has substantial storage cost Enhanced SPLASHE reduces this cost by up to 10x With 10x more storage, we avoid DET entirely! Reduces risk of information leaks

How efficient is Seabed for real-world applications? Queries 1 to 15 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 End-to-end latency (s) 400 300 200 100 Paillier Seabed No Enc. Same ad-analytics application from Microsoft Measured: End-to-end latency of 15 queries Results No encryption is about 10x faster than Paillier across all queries Seabed is almost as fast as no encryption (within 15-44%) It is possible to do analytics on encrypted big data!

Summary Big-data analytics on encrypted data is difficult Key challenges: Performance, security We introduce additive symmetric homomorphic encryption (ASHE) Result: much better performance when analyst and data owner trust each other We present a schema transformation called SPLASHE Result: Often avoids the need for weaker encryption  better security Seabed: an extension of Spark that uses ASHE and SPLASHE Up to 100x faster than previous systems Seabed is fast enough for real-world big data applications Any Questions?

References [EYROCRYPT99] P.Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Proc. EURO- CRYPT, 1999. [SOSP11] Popa, Raluca Ada, et al. "CryptDB: protecting confidentiality with encrypted query processing." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011. [VLDB13] Tu, S., Kaashoek, M. F., Madden, S., & Zeldovich, N. (2013, March). Processing analytical queries over encrypted data. In Proceedings of the VLDB Endowment (Vol. 6, No. 5, pp. 289-300). VLDB Endowment. [SQL16] https://www.microsoft.com/en-us/cloud-platform/sql-server [CCS15] Naveed, Muhammad, Seny Kamara, and Charles V. Wright. "Inference attacks on property-preserving encrypted databases." Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2015.