Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc. 2011July 25, 2011
Who am I An architect working on Hadoop full time since the beginning of the project (Jan ‘06) − Primarily focused on MapReduce Tech-lead on adding security to Hadoop Co-founded Hortonworks this month Before Hadoop – Yahoo Search WebMap Before Yahoo – NASA, Sun PhD from UC Irvine
What is Hadoop? A framework for storing and processing big data on lots of commodity machines. − Up to 4,500 machines in a cluster − Up to 20 PB in a cluster Open Source Apache project High reliability done in software − Automated failover for data and computation Implemented in Java Primary data analysis platform at Yahoo! − 40,000+ machines running Hadoop − More than 1,000,000 jobs every month
twice the engagement 5 Personalized for each visitor Result: twice the engagement +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News InterestsTop Searches Case Study: Yahoo Front Page
Problem Yahoo! has more yahoos than clusters. Hundreds of yahoos using Hadoop each month 40,000 computers in ~20 Hadoop clusters. Sharing requires isolation or trust. Different users need different data. Not all yahoos should have access to sensitive data − financial data and PII In Hadoop 0.20, easy to impersonate. − Segregate different data on separate clusters 6
Solution Prevent unauthorized HDFS access All HDFS clients must be authenticated. Including tasks running as part of MapReduce jobs And jobs submitted through Oozie. Users must also authenticate servers Otherwise fraudulent servers could steal credentials Integrate Hadoop with Kerberos Provides well tested open source distributed authentication system. 7
Requirements Security must be optional. − Not all clusters are shared between users. Hadoop commands must not prompt for passwords − Must have single sign on. − Otherwise trojan horse versions are easy to write. Must support backwards compatibility − HFTP must be secure, but allow reading from insecure clusters
Primary Communication Paths 9
Definitions Authentication – Determining the user − Hadoop 0.20 completely trusted the user User passes their username and groups over wire − We need it on both RPC and Web UI. Authorization – What can that user do? − HDFS had owners, groups and permissions since − Map/Reduce had nothing in Auditing – Who did what? − Available since 0.20
Authentication Changes low-level transport RPC authentication using SASL − Kerberos (GSSAPI) − Token (Digest-MD5) − Simple Browser HTTP secured via plugin Tool HTTP (eg. fsck) via SSL/Kerberos
Authorization HDFS − Command line unchanged − Web UI enforces authentication MapReduce added Access Control Lists − Lists of users and groups that have access. − mapreduce.job.acl-view-job – view job − mapreduce.job.acl-modify-job – kill or modify job
Auditing A critical part of security is an accurate method for determining who did what. −Almost useless until you have strong authentication HDFS Audit log tracks −Reading or writing of files MapReduce audit log tracks −Launching or modifying job properties © Hortonworks Inc
Kerberos and Single Sign-on Kerberos allows user to sign in once − Obtains Ticket Granting Ticket (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default − Once you have a TGT, Hadoop commands just work hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir 14
API Changes Very Minimal API Changes −Most applications work unchanged − UserGroupInformation *completely* changed. MapReduce added secret credentials − Available from JobConf and JobContext − Never displayed via Web UI Automatically get tokens for HDFS − Primary HDFS, File{In,Out}putFormat, and DistCp − Can set mapreduce.job.hdfs-servers 15
MapReduce task-level security MapReduce tasks run as submitting user. −No more accidently killing TaskTrackers! −Use a setuid C program. Task output logs aren’t globally visible. Task work directories aren’t globally visible. Distributed cache is split −Public – shared between all users −Private – shared between jobs of same user © Hortonworks Inc
Web UIs Hadoop relies on Web User Interfaces served from embedded Jetty. − These need to be authenticated also… Web UI authentication is pluggable. − SPENGO or static user plug-ins are available − Companies may need or want their own systems All servlets enforce permissions based on the authenticated user.
Proxy-Users Some services access HDFS and MapReduce as other users. Configure service masters (NameNode and JobTracker) with the proxy user: − For each proxy user, configuration defines: Who the proxy service can impersonate Which hosts they can impersonate from New admin commands to refresh − Don’t need to bounce cluster 18
Out of Scope Encryption − RPC transport − Block transport protocol − On disk File Access Control Lists − Still use Unix-style owner, group, other permissions Non-Kerberos Authentication − Much easier now that framework is available
Deployment The security team worked hard to get security added to Hadoop on schedule. −Roll out was smoothest major Hadoop version in a long time. −In the and upcoming release. −Measured performance degradation < 3% Security Development team: − Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang Currently deployed on all shared clusters (alpha, science, and production) at Yahoo!
Incident after Deployment Only tense incident involved one cluster where 1/3 of the machines dropped out of the cluster after a day. Had to diagnose what had gone wrong. The dropped machines had newer keytab files! An operator had regenerated the keys on 1/3 of the cluster after it was running. Servers failed when they tried to renew their tickets. © Hortonworks Inc
Hadoop Eco-system Security percolates upward… −You can only be as secure as the lower levels −Pig finished integrating with security −Oozie supports security −HBase is being updated for security All backing data files are owned by HBase user. Doesn’t support reading/writing files directly by application −Hive is also being updated Doesn’t support column level permissions © Hortonworks Inc
Questions? Questions should be sent to: − Security holes should be sent to: − Thanks!