Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc.

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Cloud Computing: hadoop Security Design -2009
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Resource Management with YARN: YARN Past, Present and Future
National Center for Supercomputing Applications Integrating MyProxy with Site Authentication Jim Basney Senior Research Scientist National Center for Supercomputing.
ACCESS CONTROL MANAGEMENT Project Progress (as of March 3) By: Poonam Gupta Sowmya Sugumaran.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Technical Brief v1.0. Communication tools that broadcast visual content directly onto the screens of computers, using multiple channels and formats Easy.
Hortonworks. We do Hadoop.
© Hortonworks Inc Adding ACID Updates to Hive October 2013 Page 1 Owen
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
Edwin Sarmiento Microsoft MVP – Windows Server System Senior Systems Engineer/Database Administrator Fujitsu Asia Pte Ltd
AFS & Kerberos Best Practices Workshop 2008 Design Goals Functions that require authentication Solution Space Kerberos, GSSAPI or SASL (Decide on your.
Security Directions - Release 6 and beyond SearchDomino.com Webcast Patricia Booth Security and Directory Product Management 9/25/02.
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter Four Configuring Outlook and Outlook Web Access.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Integrating with UCSF’s Shibboleth system
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Authentication Applications Unit 6. Kerberos In Greek and Roman mythology, is a multi-headed (usually three-headed) dog, or "hellhound” with a serpent's.
Single Sign-on with Kerberos 1 Chris Eberle Ryan Thomas RC Johnson Kim-Lan Tran CS-591 Fall 2008.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Hao Wang Computer Sciences Department University of Wisconsin-Madison Authentication and Authorization.
CAS Lightning Talk Jasig-Sakai 2012 Tuesday June 12th 2012 Atlanta, GA Andrew Petro - Unicon, Inc.
ACCESS CONTROL MANAGEMENT Project Progress (as of March 3) By: Poonam Gupta Sowmya Sugumaran.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
A Distributive Server Alberto Pareja-Lecaros. Introduction Uses of distributive computing - High powered applications - Ever-expanding server so there’s.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Windows Role-Based Access Control Longhorn Update
Page 1 © Hortonworks Inc – All Rights Reserved More Data, More Problems A Practical Guide to Testing on Hadoop 2015 Michael Miklavcic.
Ins and Outs of Authenticating Users Requests to IIS 6.0 and ASP.NET Chris Adams Program Manager IIS Product Unit Microsoft Corporation.
© Hortonworks Inc Hadoop and Kerberos: The madness beyond the gate Steve 2015.
Hadoop implementation of MapReduce computational model Ján Vaňo.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
Week 4 Objectives Overview of Group Policy Group Policy Processing Implementing a Central Store for Administrative Templates.
1 G52IWS: Example Web-services Chris Greenhalgh. 2 Contents Software requirements AXIS web service run-time components Getting started with Jetty & AXIS.
UNDERSTANDING YOUR OPTIONS FOR CLIENT-SIDE DEVELOPMENT IN OFFICE 365 Mark Rackley
Active Directory. Computers in organizations Computers are linked together for communication and sharing of resources There is always a need to administer.
Next Generation of Apache Hadoop MapReduce Owen
COOKIES AND SESSIONS.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
What is Kerberos? Network authentication protocol Developed at MIT in the mid 1980s Kerberos is a three-headed dog Available as open source or in supported.
BIG DATA/ Hadoop Interview Questions.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Integrating ArcSight with Enterprise Ticketing Systems
Integrating ArcSight with Enterprise Ticketing Systems
Hadoop.
Introduction to Distributed Platforms
Chapter 10 Data Analytics for IoT
Apache Hadoop YARN: Yet Another Resource Manager
Hadoop Clusters Tess Fulkerson.
IBM Certified WAS 8.5 Administrator
Unit 27: Network Operating Systems
Hadoop Basics.
Lecture 16 (Intro to MapReduce and Hadoop)
Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Presentation transcript:

Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc. 2011July 25, 2011

Who am I An architect working on Hadoop full time since the beginning of the project (Jan ‘06) − Primarily focused on MapReduce Tech-lead on adding security to Hadoop Co-founded Hortonworks this month Before Hadoop – Yahoo Search WebMap Before Yahoo – NASA, Sun PhD from UC Irvine

What is Hadoop? A framework for storing and processing big data on lots of commodity machines. − Up to 4,500 machines in a cluster − Up to 20 PB in a cluster Open Source Apache project High reliability done in software − Automated failover for data and computation Implemented in Java Primary data analysis platform at Yahoo! − 40,000+ machines running Hadoop − More than 1,000,000 jobs every month

twice the engagement 5 Personalized for each visitor Result: twice the engagement +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News InterestsTop Searches Case Study: Yahoo Front Page

Problem Yahoo! has more yahoos than clusters. Hundreds of yahoos using Hadoop each month 40,000 computers in ~20 Hadoop clusters. Sharing requires isolation or trust. Different users need different data. Not all yahoos should have access to sensitive data − financial data and PII In Hadoop 0.20, easy to impersonate. − Segregate different data on separate clusters 6

Solution Prevent unauthorized HDFS access All HDFS clients must be authenticated. Including tasks running as part of MapReduce jobs And jobs submitted through Oozie. Users must also authenticate servers Otherwise fraudulent servers could steal credentials Integrate Hadoop with Kerberos Provides well tested open source distributed authentication system. 7

Requirements Security must be optional. − Not all clusters are shared between users. Hadoop commands must not prompt for passwords − Must have single sign on. − Otherwise trojan horse versions are easy to write. Must support backwards compatibility − HFTP must be secure, but allow reading from insecure clusters

Primary Communication Paths 9

Definitions Authentication – Determining the user − Hadoop 0.20 completely trusted the user User passes their username and groups over wire − We need it on both RPC and Web UI. Authorization – What can that user do? − HDFS had owners, groups and permissions since − Map/Reduce had nothing in Auditing – Who did what? − Available since 0.20

Authentication Changes low-level transport RPC authentication using SASL − Kerberos (GSSAPI) − Token (Digest-MD5) − Simple Browser HTTP secured via plugin Tool HTTP (eg. fsck) via SSL/Kerberos

Authorization HDFS − Command line unchanged − Web UI enforces authentication MapReduce added Access Control Lists − Lists of users and groups that have access. − mapreduce.job.acl-view-job – view job − mapreduce.job.acl-modify-job – kill or modify job

Auditing A critical part of security is an accurate method for determining who did what. −Almost useless until you have strong authentication HDFS Audit log tracks −Reading or writing of files MapReduce audit log tracks −Launching or modifying job properties © Hortonworks Inc

Kerberos and Single Sign-on Kerberos allows user to sign in once − Obtains Ticket Granting Ticket (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default − Once you have a TGT, Hadoop commands just work hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir 14

API Changes Very Minimal API Changes −Most applications work unchanged − UserGroupInformation *completely* changed. MapReduce added secret credentials − Available from JobConf and JobContext − Never displayed via Web UI Automatically get tokens for HDFS − Primary HDFS, File{In,Out}putFormat, and DistCp − Can set mapreduce.job.hdfs-servers 15

MapReduce task-level security MapReduce tasks run as submitting user. −No more accidently killing TaskTrackers! −Use a setuid C program. Task output logs aren’t globally visible. Task work directories aren’t globally visible. Distributed cache is split −Public – shared between all users −Private – shared between jobs of same user © Hortonworks Inc

Web UIs Hadoop relies on Web User Interfaces served from embedded Jetty. − These need to be authenticated also… Web UI authentication is pluggable. − SPENGO or static user plug-ins are available − Companies may need or want their own systems All servlets enforce permissions based on the authenticated user.

Proxy-Users Some services access HDFS and MapReduce as other users. Configure service masters (NameNode and JobTracker) with the proxy user: − For each proxy user, configuration defines: Who the proxy service can impersonate Which hosts they can impersonate from New admin commands to refresh − Don’t need to bounce cluster 18

Out of Scope Encryption − RPC transport − Block transport protocol − On disk File Access Control Lists − Still use Unix-style owner, group, other permissions Non-Kerberos Authentication − Much easier now that framework is available

Deployment The security team worked hard to get security added to Hadoop on schedule. −Roll out was smoothest major Hadoop version in a long time. −In the and upcoming release. −Measured performance degradation < 3% Security Development team: − Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang Currently deployed on all shared clusters (alpha, science, and production) at Yahoo!

Incident after Deployment Only tense incident involved one cluster where 1/3 of the machines dropped out of the cluster after a day. Had to diagnose what had gone wrong. The dropped machines had newer keytab files! An operator had regenerated the keys on 1/3 of the cluster after it was running. Servers failed when they tried to renew their tickets. © Hortonworks Inc

Hadoop Eco-system Security percolates upward… −You can only be as secure as the lower levels −Pig finished integrating with security −Oozie supports security −HBase is being updated for security All backing data files are owned by HBase user. Doesn’t support reading/writing files directly by application −Hive is also being updated Doesn’t support column level permissions © Hortonworks Inc

Questions? Questions should be sent to: − Security holes should be sent to: − Thanks!