Apache Spark & Apache Zeppelin: Enterprise Security for production deployments Vinay Shukla Director, Product Management Dec 8, 2016 Twitter: @neomythos Hortonworks: Powering the Future of Data
Thank You Thank you all the users of Hadoop & Spark Thank you if you are developing, contributing to Hadoop & Spark Thank you for coming to this session. Hortonworks: Powering the Future of Data
whoami Recovering Programmer, Product Management Product Management Spark for 2.5 + years, Hadoop for 3+ years Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin
What are the enterprise security requirements? Spark user should be authenticated Integrate with corporate LDAP/AD Allow only authorized users access Audit all access Protect data both in motion & at rest Easily manage all security Make security easy to manage …
Security: Rings of Defense Data Protection Wire encryption HDFS TDE/DARE Others Perimeter Level Security Network Security (i.e. Firewalls) Authentication Kerberos Knox (Other Gateways) OS Security Authorization Apache Ranger/Sentry HDFS Permissions HDFS ACLs YARN ACL
Interacting with Spark Spark Thrift Server Driver Zeppelin REST Server Ex Ex Driver Driver Spark on YARN Spark- Shell Driver
Context: Spark Deployment Modes Spark on YARN Spark driver (SparkContext) in YARN AM(yarn-cluster) Spark driver (SparkContext) in local (yarn-client): Spark Shell & Spark Thrift Server runs in yarn-client only Client App Master Client App Master Spark Driver Spark Driver Executor Executor YARN-Client YARN-Cluster
Spark on YARN Hadoop Cluster Spark Submit 2 Spark AM 3 YARN RM 1 4 John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 Hadoop Cluster HDFS Node Manager Executor
Spark – Security – Four Pillars Authentication Authorization Audit Encryption The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus Ensure network is secure Spark leverages Kerberos on YARN
Authenticate users with AD/LDAP kinit YARN launches Spark Executors using John Doe’s identity 1 6 Spark AM 3 7 Use Spark ST, submit Spark Job Executor reads from HDFS using John Doe’s delegation token NN 5 John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 2 Spark gets Namenode (NN) service ticket Hadoop Cluster 4 Get service ticket for Spark, AD/LDAP KDC
Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m - -executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
Allow only authorized users access to Spark jobs Can John read this file Can John launch this job? Ranger/Sentry HDFS Executors read from HDFS Use Spark ST, submit Spark Job YARN Cluster A B C Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP John Doe Get Namenode (NN) service ticket Client gets service ticket for Spark KDC
Secure data in motion: Wire Encryption with Spark Shuffle Data Spark Submit Shuffle Service Control/RPC RM NM Data Source Shuffle BlockTransfer For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming Read/Write Data AM Driver Ex 1 Ex N FS – Broadcast, File Download
Spark Communication Encryption Settings Shuffle Data NM > Ex leverages YARN based SSL Control/RPC spark.authenticate = true. Leverage YARN to distribute keys Shuffle BlockTransfer spark.authenticate.enableSaslEncryption= true Read/Write Data Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS FS – Broadcast, File Download spark.ssl.enabled = true
Sharp Edges with Spark Security SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop Lowers security, forces STS to run as Hive user to read all data Use SparkSQL via shell or programmatic API https://issues.apache.org/jira/browse/SPARK-5159 Spark Stream + Kafka + Kerberos No SSL support yet Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data
SparkSQL: Fine grained security
Key Features: Spark Column Security with LLAP Fine-Grained Column Level Access Control for SparkSQL. Fully dynamic policies per user. Doesn’t require views. Use Standard Ranger policies and tools to control access and masking policies. Flow: SparkSQL gets data locations known as “splits” from HiveServer and plans query. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. Spark gets a modified query plan based on dynamic security policy. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
Example: Per-User Row Filtering by Region in SparkSQL Spark User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Spark User 2 (East Region) Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” AND region = “west” LLAP Data Access User ID Region Total Spend 1 East 5,131 2 27,828 3 West 55,493 4 7,193 5 18,193 Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm
Apache Zeppelin Security Hortonworks: Powering the Future of Data
Apache Zeppelin: Authentication + SSL Ex 3 1 SSL John Doe Spark on YARN 2 Firewall LDAP
Security in Apache Zeppelin? Zeppelin leverages Apache Shiro for authentication/authorization
Edit with Ambari or your favorite text editor Example Shiro.ini # ======================= # Shiro INI configuration [main] ## LDAP/AD configuration [users] # The 'users' section is for simple deployments # when you only need a small number of statically-defined # set of User accounts. [urls] # The 'urls' section is used for url-based security # Edit with Ambari or your favorite text editor
Apache Zeppelin: AD Authentication Configure Zeppelin to use AD [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000 shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
Apache Zeppelin: LDAP Authentication Configure Zeppelin to use LDAP [main] ldapRealm = org.apache.zeppelin.server.LdapGroupRealm ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm ldapRealm.contextFactory.environment[ldap.searchBase] = DC=hdpqa,DC=example,DC=com ldapRealm.userDnTemplate = uid={0},OU=Accounts,DC=hdpqa,DC=example,DC=com ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 ldapRealm.contextFactory.authenticationMechanism = SIMPLE sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager securityManager.sessionManager = $sessionManager # 86,400,000 milliseconds = 24 hour securityManager.sessionManager.globalSessionTimeout = 86400000 shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
Don’t want passwords in clear in shiro.ini? Create an entry for AD credential Zeppelin leverages Hadoop Credential API hadoop credential create activeDirectoryRealm.systemPassword -provider jceks:///etc/zeppelin/conf/credentials.jceks Enter password: Enter password again: activeDirectoryRealm.systemPassword has been successfully created. org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated. Make credentials.jceks only Zeppelin user readable chmod 400 with only Zeppelin process r/w access, no other user allowed access to Credentials Edit shiro.in activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks
Want to connect to LDAP over SSL? Change protocol to ldaps in shiro.ini ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 If LDAP is using self signed certificate, import the certificate into truststore of JVM running Zeppelin echo -n | openssl s_client –connect ldap.example.com:389 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/examplecert.crt keytool –import -keystore $JAVA_HOME/jre/lib/security/cacerts \ -storepass changeit -noprompt -alias mycert -file /tmp/examplecert.crt
Zeppelin + Livy E2E Security Spark Ispark Group Interpreter Livy APIs Livy Job runs as John Doe SPNego: Kerberos Kerberos/RPC John Doe Yarn Zeppelin LDAP/LDAPS LDAP Hortonworks: Powering the Future of Data
Apache Zeppelin: Authorization Authorization in Zeppelin Authorization at Data Level Note level authorization Grant Permissions (Owner, Reader, Writer) to users/groups on Notes LDAP Group integration Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini For Spark with Zeppelin > Livy > Spark Identity Propagation Jobs run as End-User For Hive with Zeppelin > JDBC interpreter Shell Interpreter Runs as end-user Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks) [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
Map admin role to AD Group Allows mapped AD group access to Configure Interpreters [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000
User reports: Can’t see interpreter Page Zeppelin has URL based access control enabled User does not have the role Or Role incorrectly mapped [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000
User reports: Livy interpreter fails to run with access error Ensure Livy has ability to proxy user Ensure Livy has Impersonation enabled In /etc/livy/conf/livy.conf livy.impersonation.enabled true Edit HDFS core-site.xml via Ambari: <property> <name>hadoop.proxyuser.livy_qa.groups</name> <value>*</value> </property> <name>hadoop.proxyuser.livy_qa.hosts</name>
Apache Zeppelin: Credentials Credentials in Zeppelin LDAP/AD account Zeppelin leverages Hadoop Credential API Interpreter Credentials Not solved yet This is still an open issue Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
Thank You Vinay Shukla @neomythos All Images from Flicker Commons