How to Protect Big Data in a Containerized Environment Thomas Phelan Chief Architect, BlueData @tapbluedata
Outline Securing a Big Data Environment Data Protection Transparent Data Encryption Transparent Data Encryption in a Containerized Environment Takeaways
In the Beginning … Hadoop was used to process public web data No compelling need for security No user or service authentication No data security
Then Hadoop Became Popular Security is important.
Layers of Security in Hadoop Access Authentication Authorization Data Protection Auditing Policy (protect from human error)
Hadoop Security: Data Protection Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
Focus on Data Security Confidentiality Integrity Availability Confidentiality is lost when data is accessed by someone not authorized to do so Integrity Integrity is lost when data is modified in unexpected ways Availability Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
Hadoop Distributed File System (HDFS) Data Security Features Access Control Data Encryption Data Replication
Access Control Simple Kerberos Identity determined by host operating system Kerberos Identity determined by Kerberos credentials One realm for both compute and storage Required for HDFS Transparent Data Encryption
Data Encryption Transforming data
Data Replication 3 way replication Erasure Coding Can survive any 2 failures Erasure Coding Can survive more than 2 failures depending on parity bit configuration
HDFS with End-to-End Encryption Confidentiality Data Access Integrity Data Access + Data Encryption Availability Data Access + Data Replication
Data Encryption How to transform the data? Cleartext Ciphertext 101011100010010001011100010100011101010101010100011101010101110 Cleartext XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Ciphertext
Data Encryption – At Rest Data is encrypted while on persistent media (disk)
Data Encryption – In Transit Data is encrypted while traveling over the network
The Whole Process Ciphertext
HDFS Transparent Data Encryption (TDE) End-to-end encryption Data is encrypted/decrypted at the client Data is protected at rest and in transit Transparent No application level code changes required
HDFS TDE – Design Goals: Only an authorized client/user can access cleartext HDFS never stores cleartext or unencrypted data encryption keys
HDFS TDE – Terminology Encryption Zone A directory whose file contents will be encrypted upon write and decrypted upon read An EZKEY is generated for each zone
HDFS TDE – Terminology EZKEY – encryption zone key DEK – data encryption key EDEK – encrypted data encryption key
HDFS TDE - Data Encryption The same key is used to encrypt and decrypt data The size of the ciphertext is exactly the same as the size of the original cleartext EZKEY + DEK => EDEK EDEK + EZKEY => DEK
HDFS TDE - Services HDFS NameNode (NN) Kerberos Key Distribution Center (KDC) Hadoop Key Management Server (KMS) Key Trustee Server
HDFS TDE – Security Concepts Division of Labor KMS creates the EZKEY & DEK KMS encrypts/decrypts the DEK/EDEK using the EZKEY HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.
HDFS TDE – Security Concepts The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …
HDFS Examples Simplified for the sake of clarity: Kerberos actions not shown NameNode EDEK cache not shown
HDFS – Create Encryption Zone 3. Create EZKEY /encrypted_dir xattr: EZKEYNAME EZKEYNAME = KEY
HDFS – Create Encrypted File 1. Create file 2. Create EDEK 5. Return Success 4. Store EDEK /encrypted_dir/file encrypted data 3. Create EDEK /encrypted_dir/file xattr: EDEK
HDFS TDE – File Write Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME read unencrypted data /encrypted_dir/file write encrypted data 4. Decrypt DEK from EDEK 5. Return DEK
HDFS TDE – File Read Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME /encrypted_dir/file read encrypted data write unencrypted data 4. Decrypt DEK from EDEK 5. Return DEK
Bring in the Containers (i.e. Docker) Issues with containers are the same for any virtualization platform Multiple compute clusters Multiple HDFS file systems Multiple Kerberos realms Cross-realm trust configuration
Containers as Virtual Machines Note – this is not about using containers to run Big Data tasks:
Containers as Virtual Machines This is about running Hadoop / Big Data clusters in containers: cluster
Containers as Virtual Machines A true containerized Big Data environment:
KDC Cross-Realm Trust Different KDC realms for corporate, data, and compute Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
KDC Cross-Realm Trust Different KDC realms for corporate, data, and compute One-way trust Compute realm trusts the corporate realm Data realm trusts corporate realm Data realm trusts the compute realm
KDC Cross-Realm Trust CORP.ENTERPRISE.COM Realm user@CORP.ENTERPRISE.COM COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: COMPUTE.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM Hadoop Cluster Hadoop Key Management Service HDFS: hdfs://remotedata/ rm@COMPUTE.ENTERPRISE.COM
Key Management Service Must be enterprise quality Key Trustee Server Java KeyStore KMS Cloudera Navigator Key Trustee Server
Containers as Virtual Machines A true containerized Big Data environment: DataLake CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DataLake DataLake CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
Key Takeaways Hadoop has many security layers HDFS Transparent Data Encryption (TDE) is best of breed Security is hard (complex) Virtualization / containerization only makes it potentially harder Compute and storage separation with virtualization / containerization can make it even harder still Jason to briefly cover agenda
Key Takeaways Be careful with a build vs. buy decision for containerized Big Data Recommendation: buy one already built There are turnkey solutions (e.g. BlueData EPIC) Jason to briefly cover agenda Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
@tapbluedata www.bluedata.com BlueData Booth #1508 in Strata Expo Hall