Vinay Shukla Director, Product Management Dec 8, 2016

Slides:



Advertisements
Similar presentations
Authenticating Users. Objectives Explain why authentication is a critical aspect of network security Explain why firewalls authenticate and how they identify.
Advertisements

Securing the Hadoop Ecosystem
FIREWALLS & NETWORK SECURITY with Intrusion Detection and VPNs, 2 nd ed. 10 Authenticating Users By Whitman, Mattord, & Austin© 2008 Course Technology.
Hortonworks. We do Hadoop.
MongoDB Sharding and its Threats
Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc.
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
Course 201 – Administration, Content Inspection and SSL VPN
Edwin Sarmiento Microsoft MVP – Windows Server System Senior Systems Engineer/Database Administrator Fujitsu Asia Pte Ltd
TAM STE Series 2008 © 2008 IBM Corporation WebSEAL SSO, Session 108/2008 TAM STE Series WebSEAL SSO, Session 1 Presented by: Andrew Quap.
Slide Master Layout Useful for revisions and projector test  First-level bullet  Second levels  Third level  Fourth level  Fifth level  Drop body.
Introduction to SQL Server 2000 Security Dave Watts CTO, Fig Leaf Software
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter Four Configuring Outlook and Outlook Web Access.
Survey of Identity Repository Security Models JSR 351, Sep 2012.
Windows Security. Security Windows 2000/XP Professional security oriented Authentication Authorization Internet Connection Firewall.
CAS Lightning Talk Jasig-Sakai 2012 Tuesday June 12th 2012 Atlanta, GA Andrew Petro - Unicon, Inc.
ArcGIS Server for Administrators
Planning a Microsoft Windows 2000 Administrative Structure Designing default administrative group membership Designing custom administrative groups local.
Ins and Outs of Authenticating Users Requests to IIS 6.0 and ASP.NET Chris Adams Program Manager IIS Product Unit Microsoft Corporation.
Asia Pacific SharePoint Conference 2007 May 15th to 16th, 2007 Hilton Hotel Sydney.
© Hortonworks Inc Hadoop and Kerberos: The madness beyond the gate Steve 2015.
Monitoring Hive: Metrics and WebUI
Lens Server REST API for querying and schema update JDBC Client Java Client CLI Applications – Reporting, Ad Hoc Queries OLAP Cube Metastore Hive (MR)
JN0-561 Juniper Juniper Networks Certified Internet Associate, J-series Visit:
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Agenda  Microsoft Directory Synchronization Tool  Active Directory Federation Server  ADFS Proxy  Hybrid Features – LAB.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.
ArcGIS for Server Security: Advanced
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Secure Connected Infrastructure
Jun Rao co-founder at Confluent, Inc
Data Virtualization Tutorial… SSL with CIS Web Data Sources
Enterprise grade security in your Hadoop clusters on Azure
Flink Security Enhancements
Consuming OAuth Services in Alfresco Share
Daniel Templeton, Cloudera, Inc.
Introduction to Distributed Platforms
ITCS-3190.
Spark and YARN: Better Together
Institute for Cyber Security
6/16/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
How to Solve BigData Security Puzzle?
BBMRI Competence Centre Status Report
Spark Presentation.
Active Directory Fundamentals
Radius, LDAP, Radius used in Authenticating Users
Introduction to SQL Server 2000 Security
Power BI Security Best Practices
Common Security Mistakes
Excel Services Deployment and Administration
IBM Certified WAS 8.5 Administrator
Enterprise security for big data solutions on Azure HDInsight
IIS.
HDFS on Kubernetes -- Lessons Learned
How to Protect Big Data in a Containerized Environment
SharePoint Online Hybrid – Configure Outbound Search
AWS Cloud Computing Masaki.
HDFS on Kubernetes -- Lessons Learned
Spark and Scala.
Designing IIS Security (IIS – Internet Information Service)
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Azure AD Simon May Technical Evangelist.
Securing web applications Externally
Introduction to Azure Data Lake
9/8/ :03 PM © 2006 Microsoft Corporation. All rights reserved.
Pig Hive HBase Zookeeper
Presentation transcript:

Apache Spark & Apache Zeppelin: Enterprise Security for production deployments Vinay Shukla Director, Product Management Dec 8, 2016 Twitter: @neomythos Hortonworks: Powering the Future of Data

Thank You Thank you all the users of Hadoop & Spark Thank you if you are developing, contributing to Hadoop & Spark Thank you for coming to this session. Hortonworks: Powering the Future of Data

whoami Recovering Programmer, Product Management Product Management Spark for 2.5 + years, Hadoop for 3+ years Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin

What are the enterprise security requirements? Spark user should be authenticated Integrate with corporate LDAP/AD Allow only authorized users access Audit all access Protect data both in motion & at rest Easily manage all security Make security easy to manage …

Security: Rings of Defense Data Protection Wire encryption HDFS TDE/DARE Others Perimeter Level Security Network Security (i.e. Firewalls) Authentication Kerberos Knox (Other Gateways) OS Security Authorization Apache Ranger/Sentry HDFS Permissions HDFS ACLs YARN ACL

Interacting with Spark Spark Thrift Server Driver Zeppelin REST Server Ex Ex Driver Driver Spark on YARN Spark- Shell Driver

Context: Spark Deployment Modes Spark on YARN Spark driver (SparkContext) in YARN AM(yarn-cluster) Spark driver (SparkContext) in local (yarn-client): Spark Shell & Spark Thrift Server runs in yarn-client only Client App Master Client App Master Spark Driver Spark Driver Executor Executor YARN-Client YARN-Cluster

Spark on YARN Hadoop Cluster Spark Submit 2 Spark AM 3 YARN RM 1 4 John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 Hadoop Cluster HDFS Node Manager Executor

Spark – Security – Four Pillars Authentication Authorization Audit Encryption The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus Ensure network is secure Spark leverages Kerberos on YARN

Authenticate users with AD/LDAP kinit YARN launches Spark Executors using John Doe’s identity 1 6 Spark AM 3 7 Use Spark ST, submit Spark Job Executor reads from HDFS using John Doe’s delegation token NN 5 John Doe John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 2 Spark gets Namenode (NN) service ticket Hadoop Cluster 4 Get service ticket for Spark, AD/LDAP KDC

Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m - -executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10

Allow only authorized users access to Spark jobs Can John read this file Can John launch this job? Ranger/Sentry HDFS Executors read from HDFS Use Spark ST, submit Spark Job YARN Cluster A B C Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP John Doe Get Namenode (NN) service ticket Client gets service ticket for Spark KDC

Secure data in motion: Wire Encryption with Spark Shuffle Data Spark Submit Shuffle Service Control/RPC RM NM Data Source Shuffle BlockTransfer For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming Read/Write Data AM Driver Ex 1 Ex N FS – Broadcast, File Download

Spark Communication Encryption Settings Shuffle Data NM > Ex leverages YARN based SSL Control/RPC spark.authenticate = true. Leverage YARN to distribute keys Shuffle BlockTransfer spark.authenticate.enableSaslEncryption= true Read/Write Data Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS FS – Broadcast, File Download spark.ssl.enabled = true

Sharp Edges with Spark Security SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop Lowers security, forces STS to run as Hive user to read all data Use SparkSQL via shell or programmatic API https://issues.apache.org/jira/browse/SPARK-5159 Spark Stream + Kafka + Kerberos No SSL support yet Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data

SparkSQL: Fine grained security

Key Features: Spark Column Security with LLAP Fine-Grained Column Level Access Control for SparkSQL. Fully dynamic policies per user. Doesn’t require views. Use Standard Ranger policies and tools to control access and masking policies. Flow: SparkSQL gets data locations known as “splits” from HiveServer and plans query. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. Spark gets a modified query plan based on dynamic security policy. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3

Example: Per-User Row Filtering by Region in SparkSQL Spark User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Spark User 2 (East Region) Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” AND region = “west” LLAP Data Access User ID Region Total Spend 1 East 5,131 2 27,828 3 West 55,493 4 7,193 5 18,193 Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm

Apache Zeppelin Security Hortonworks: Powering the Future of Data

Apache Zeppelin: Authentication + SSL Ex 3 1 SSL John Doe Spark on YARN 2 Firewall LDAP

Security in Apache Zeppelin? Zeppelin leverages Apache Shiro for authentication/authorization

Edit with Ambari or your favorite text editor Example Shiro.ini # ======================= # Shiro INI configuration [main] ## LDAP/AD configuration [users] # The 'users' section is for simple deployments # when you only need a small number of statically-defined # set of User accounts. [urls] # The 'urls' section is used for url-based security # Edit with Ambari or your favorite text editor

Apache Zeppelin: AD Authentication Configure Zeppelin to use AD [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000 shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)

Apache Zeppelin: LDAP Authentication Configure Zeppelin to use LDAP [main] ldapRealm = org.apache.zeppelin.server.LdapGroupRealm ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm ldapRealm.contextFactory.environment[ldap.searchBase] = DC=hdpqa,DC=example,DC=com ldapRealm.userDnTemplate = uid={0},OU=Accounts,DC=hdpqa,DC=example,DC=com ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 ldapRealm.contextFactory.authenticationMechanism = SIMPLE sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager securityManager.sessionManager = $sessionManager # 86,400,000 milliseconds = 24 hour securityManager.sessionManager.globalSessionTimeout = 86400000 shiro.loginUrl = /api/login Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)

Don’t want passwords in clear in shiro.ini? Create an entry for AD credential Zeppelin leverages Hadoop Credential API hadoop credential create activeDirectoryRealm.systemPassword -provider jceks:///etc/zeppelin/conf/credentials.jceks Enter password: Enter password again: activeDirectoryRealm.systemPassword has been successfully created. org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated. Make credentials.jceks only Zeppelin user readable chmod 400 with only Zeppelin process r/w access, no other user allowed access to Credentials Edit shiro.in activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks

Want to connect to LDAP over SSL? Change protocol to ldaps in shiro.ini ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 If LDAP is using self signed certificate, import the certificate into truststore of JVM running Zeppelin echo -n | openssl s_client –connect ldap.example.com:389 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/examplecert.crt keytool –import -keystore $JAVA_HOME/jre/lib/security/cacerts \ -storepass changeit -noprompt -alias mycert -file /tmp/examplecert.crt

Zeppelin + Livy E2E Security Spark Ispark Group Interpreter Livy APIs Livy Job runs as John Doe SPNego: Kerberos Kerberos/RPC John Doe Yarn Zeppelin LDAP/LDAPS LDAP Hortonworks: Powering the Future of Data

Apache Zeppelin: Authorization Authorization in Zeppelin Authorization at Data Level Note level authorization Grant Permissions (Owner, Reader, Writer) to users/groups on Notes LDAP Group integration Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini For Spark with Zeppelin > Livy > Spark Identity Propagation Jobs run as End-User For Hive with Zeppelin > JDBC interpreter Shell Interpreter Runs as end-user Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks) [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]

Map admin role to AD Group Allows mapped AD group access to Configure Interpreters [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000

User reports: Can’t see interpreter Page Zeppelin has URL based access control enabled User does not have the role Or Role incorrectly mapped [main] activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000

User reports: Livy interpreter fails to run with access error Ensure Livy has ability to proxy user Ensure Livy has Impersonation enabled In /etc/livy/conf/livy.conf livy.impersonation.enabled true Edit HDFS core-site.xml via Ambari: <property> <name>hadoop.proxyuser.livy_qa.groups</name> <value>*</value> </property> <name>hadoop.proxyuser.livy_qa.hosts</name>

Apache Zeppelin: Credentials Credentials in Zeppelin LDAP/AD account Zeppelin leverages Hadoop Credential API Interpreter Credentials Not solved yet This is still an open issue Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)

Thank You Vinay Shukla @neomythos All Images from Flicker Commons