Download presentation
Presentation is loading. Please wait.
Published byMerry White Modified over 6 years ago
1
Apache Spark & Apache Zeppelin: Enterprise Security for production deployments
Vinay Shukla Director, Product Management Dec 8, 2016 Hortonworks: Powering the Future of Data
2
Thank You Thank you all the users of Hadoop & Spark
Thank you if you are developing, contributing to Hadoop & Spark Thank you for coming to this session. Hortonworks: Powering the Future of Data
3
whoami Recovering Programmer, Product Management Product Management
Spark for years, Hadoop for 3+ years Blog at Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin
4
What are the enterprise security requirements?
Spark user should be authenticated Integrate with corporate LDAP/AD Allow only authorized users access Audit all access Protect data both in motion & at rest Easily manage all security Make security easy to manage …
5
Security: Rings of Defense
Data Protection Wire encryption HDFS TDE/DARE Others Perimeter Level Security Network Security (i.e. Firewalls) Authentication Kerberos Knox (Other Gateways) OS Security Authorization Apache Ranger/Sentry HDFS Permissions HDFS ACLs YARN ACL
6
Interacting with Spark
Spark Thrift Server Driver Zeppelin REST Server Ex Ex Driver Driver Spark on YARN Spark- Shell Driver
7
Context: Spark Deployment Modes
Spark on YARN Spark driver (SparkContext) in YARN AM(yarn-cluster) Spark driver (SparkContext) in local (yarn-client): Spark Shell & Spark Thrift Server runs in yarn-client only Client App Master Client App Master Spark Driver Spark Driver Executor Executor YARN-Client YARN-Cluster
8
Spark on YARN Hadoop Cluster Spark Submit 2 Spark AM 3 YARN RM 1 4
John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 Hadoop Cluster HDFS Node Manager Executor
9
Spark – Security – Four Pillars
Authentication Authorization Audit Encryption The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : Ensure network is secure Spark leverages Kerberos on YARN
10
Authenticate users with AD/LDAP
kinit YARN launches Spark Executors using John Doe’s identity 1 6 Spark AM 3 7 Use Spark ST, submit Spark Job Executor reads from HDFS using John Doe’s delegation token NN 5 John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 2 Spark gets Namenode (NN) service ticket Hadoop Cluster 4 Get service ticket for Spark, AD/LDAP KDC
11
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab ./bin/spark-submit --class org.apache.spark.examples.SparkPi master yarn-cluster --num-executors 3 --driver-memory 512m - -executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
12
Allow only authorized users access to Spark jobs
Can John read this file Can John launch this job? Ranger/Sentry HDFS Executors read from HDFS Use Spark ST, submit Spark Job YARN Cluster A B C Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP John Doe Get Namenode (NN) service ticket Client gets service ticket for Spark KDC
13
Secure data in motion: Wire Encryption with Spark
Shuffle Data Spark Submit Shuffle Service Control/RPC RM NM Data Source Shuffle BlockTransfer For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming Read/Write Data AM Driver Ex 1 Ex N FS – Broadcast, File Download
14
Spark Communication Encryption Settings
Shuffle Data NM > Ex leverages YARN based SSL Control/RPC spark.authenticate = true. Leverage YARN to distribute keys Shuffle BlockTransfer spark.authenticate.enableSaslEncryption= true Read/Write Data Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS FS – Broadcast, File Download spark.ssl.enabled = true
15
Sharp Edges with Spark Security
SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop Lowers security, forces STS to run as Hive user to read all data Use SparkSQL via shell or programmatic API Spark Stream + Kafka + Kerberos No SSL support yet Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data
16
SparkSQL: Fine grained security
17
Key Features: Spark Column Security with LLAP
Fine-Grained Column Level Access Control for SparkSQL. Fully dynamic policies per user. Doesn’t require views. Use Standard Ranger policies and tools to control access and masking policies. Flow: SparkSQL gets data locations known as “splits” from HiveServer and plans query. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. Spark gets a modified query plan based on dynamic security policy. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
18
Example: Per-User Row Filtering by Region in SparkSQL
Spark User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Spark User 2 (East Region) Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” AND region = “west” LLAP Data Access User ID Region Total Spend 1 East 5,131 2 27,828 3 West 55,493 4 7,193 5 18,193 Fine grained Security to SparkSQL
19
Apache Zeppelin Security
Hortonworks: Powering the Future of Data
20
Apache Zeppelin: Authentication + SSL
Ex 3 1 SSL John Doe Spark on YARN 2 Firewall LDAP
21
Security in Apache Zeppelin?
Zeppelin leverages Apache Shiro for authentication/authorization
22
Edit with Ambari or your favorite text editor
Example Shiro.ini # ======================= # Shiro INI configuration [main] ## LDAP/AD configuration [users] # The 'users' section is for simple deployments # when you only need a small number of statically-defined # set of User accounts. [urls] # The 'urls' section is used for url-based security # Edit with Ambari or your favorite text editor
23
Apache Zeppelin: AD Authentication
Configure Zeppelin to use AD [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
24
Apache Zeppelin: LDAP Authentication
Configure Zeppelin to use LDAP [main] ldapRealm = org.apache.zeppelin.server.LdapGroupRealm ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm ldapRealm.contextFactory.environment[ldap.searchBase] = DC=hdpqa,DC=example,DC=com ldapRealm.userDnTemplate = uid={0},OU=Accounts,DC=hdpqa,DC=example,DC=com ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 ldapRealm.contextFactory.authenticationMechanism = SIMPLE sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager securityManager.sessionManager = $sessionManager # 86,400,000 milliseconds = 24 hour securityManager.sessionManager.globalSessionTimeout = shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
25
Don’t want passwords in clear in shiro.ini?
Create an entry for AD credential Zeppelin leverages Hadoop Credential API hadoop credential create activeDirectoryRealm.systemPassword -provider jceks:///etc/zeppelin/conf/credentials.jceks Enter password: Enter password again: activeDirectoryRealm.systemPassword has been successfully created. org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated. Make credentials.jceks only Zeppelin user readable chmod 400 with only Zeppelin process r/w access, no other user allowed access to Credentials Edit shiro.in activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks
26
Want to connect to LDAP over SSL?
Change protocol to ldaps in shiro.ini ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 If LDAP is using self signed certificate, import the certificate into truststore of JVM running Zeppelin echo -n | openssl s_client –connect ldap.example.com:389 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/examplecert.crt keytool –import -keystore $JAVA_HOME/jre/lib/security/cacerts \ -storepass changeit -noprompt -alias mycert -file /tmp/examplecert.crt
27
Zeppelin + Livy E2E Security
Spark Ispark Group Interpreter Livy APIs Livy Job runs as John Doe SPNego: Kerberos Kerberos/RPC John Doe Yarn Zeppelin LDAP/LDAPS LDAP Hortonworks: Powering the Future of Data
28
Apache Zeppelin: Authorization
Authorization in Zeppelin Authorization at Data Level Note level authorization Grant Permissions (Owner, Reader, Writer) to users/groups on Notes LDAP Group integration Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini For Spark with Zeppelin > Livy > Spark Identity Propagation Jobs run as End-User For Hive with Zeppelin > JDBC interpreter Shell Interpreter Runs as end-user Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks) [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
29
Map admin role to AD Group
Allows mapped AD group access to Configure Interpreters [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout =
30
User reports: Can’t see interpreter Page
Zeppelin has URL based access control enabled User does not have the role Or Role incorrectly mapped [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout =
31
User reports: Livy interpreter fails to run with access error
Ensure Livy has ability to proxy user Ensure Livy has Impersonation enabled In /etc/livy/conf/livy.conf livy.impersonation.enabled true Edit HDFS core-site.xml via Ambari: <property> <name>hadoop.proxyuser.livy_qa.groups</name> <value>*</value> </property> <name>hadoop.proxyuser.livy_qa.hosts</name>
32
Apache Zeppelin: Credentials
Credentials in Zeppelin LDAP/AD account Zeppelin leverages Hadoop Credential API Interpreter Credentials Not solved yet This is still an open issue Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
33
Thank You Vinay Shukla @neomythos
All Images from Flicker Commons
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.