Hardening Hadoop for the Enterprise: Managing Diverse Workloads, Securing and Governing your Big Data Platform How does IT balance the tension between.

Slides:



Advertisements
Similar presentations
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Advertisements

Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.
Resource Management with YARN: YARN Past, Present and Future
SAS on Your Cluster Serving your Data (Analysts)
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Copyright © 2007, SAS Institute Inc. All rights reserved. SAS Activity-Based Management Survey Kit (ASK): User Management & Security.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Almost 4 decades of Advanced Analytics & DM expertise.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Our Experience Running YARN at Scale Bobby Evans.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Copyright ©2003 Digitask Consultants Inc., All rights reserved Cluster Concepts Digitask Seminar November 29, 1999 Digitask Consultants, Inc.
1 © Cloudera, Inc. All rights reserved. Partner Solution Overview 1 Partner Logo Full Color Partner Logo Full Color.
1 Facilitation Facilitation is the process of making something easier or less difficult…
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
Breaking points of traditional approach What if you could handle big data?
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ® GRID AT PHAC SAS OTTAWA PLATFORM USERS SOCIETY, NOVEMBER 2012.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Database Processing Chapter "No, Drew, You Don’t Know Anything About Creating Queries.” Copyright © 2015 Pearson Education, Inc. Operational database.
1 Tree and Graph Processing On Hadoop Ted Malaska.
BIG DATA/ Hadoop Interview Questions.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
IDC Says, "Don't Move To The Cloud" Richard Whitehead Director, Intelligent Workload Management August, 2010 Ben Goodman Principal.
eBay Marketplaces Ming Ma June 27 th, 2013.
Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.
Qlik + Cloudera 10 Points of Integration
Big Data & Test Automation
OMOP CDM on Hadoop Reference Architecture
Yarn.
Hadoop.
Introduction to Distributed Platforms
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Chapter 10 Data Analytics for IoT
Chapter 14 Big Data Analytics and NoSQL
Description of compiled mobile phone data sets Roberta Radini – Istat
Distributed Databases
Apache Hadoop YARN: Yet Another Resource Manager
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Pentaho 7.1.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Distributed System Structures 16: Distributed Structures
Visual Analytics Sandbox
Introducing – SAS® Grid Manager for Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
HDFS on Kubernetes -- Lessons Learned
Introduction to Apache
HDFS on Kubernetes -- Lessons Learned
TIM TAYLOR AND JOSH NEEDHAM
Presentation transcript:

Hardening Hadoop for the Enterprise: Managing Diverse Workloads, Securing and Governing your Big Data Platform How does IT balance the tension between “one glorious cluster that serves them all” and “one cluster, one purpose – dedicated for the particular task and not to be interfered with by anything”. If they are to contain cluster sprawl, folks need help allocating a mixed workload across a shared cluster (beyond the job tracker assigning map and reduce slots), and they want to be sure the cluster is as secure as can be. Kerberos, C-groups and YARN to the rescue! This talk describes the current practices and speculates how things get better under YARN.

Agenda 1.Basics 2.Cluster Evolution Vanilla Cluster Foreign Workload Introduced Node Specialization Cluster Specialization Datacenter Integration 3.YARN 4.Security

Hadoop – and her 2 beautiful things I will spread your data out over many servers to keep it safe I will facilitate a new idea that you should send the work to the data, not the other way around. Data

Copyright © 2013, SAS Institute Inc. All rights reserved. WHY DO THIS? BECAUSE IT GETS THE ANSWERS SOOOO MUCH FASTER NameNode Client

WOW, that’s awesome. Can we join your cluster?

We’ll be very very good. Really.

Agenda 1.Basics 2.Cluster Evolution Vanilla Cluster Foreign Workload Introduced Node Specialization Cluster Specialization Datacenter Integration 1.YARN 2.Security

2012 :: Have  Want

NameNodeDataNode SecNmNodeDataNode Vanilla Cluster

NameNodeDataNode SecNmNodeDataNode Vanilla Cluster (with foreign workload)

Foreign != MapReduce & not only ( SAS )  SAS High Performance Analytics  SAS Visual Analytics  Impala  BDAS Spark  Giraph  Solr .. Hbase

NameNodeDataNode SecNmNodeDataNode Vanilla Cluster (with foreign workload) 1.Add work across entire cluster 2.Add memory to accommodate 3.Derate MapReduce to accommodate 4.Time Slice? 5.No extra copy of Data

NameNodeDataNode SecNmNodeDataNode Node Specialization (for foreign workload)

NameNodeDataNode SecNmNodeDataNode Node Specialization (for foreign workload) 1.Add workload to some … “SASnodes” 2.Add memory to SASnodes 3.Derate MapReduce on SASnodes? 4.Cgroups to make em play nice 5.Still no extra copy of Data 6.SAS writes data to SASnodes only. (balancer)

NameNodeDataNode SecNmNodeDataNode Node Specialization (for foreign workload) 1.Add workload to some … “SASnodes” 2.Add memory to SASnodes 3.Derate MapReduce on SASnodes? 4.Cgroups to make em play nice 5.Still no extra copy of Data 6.SAS writes data to SASnodes only. (balancer) CDH4 Best Practice

NameNode DataNode SecNmNode DataNode Specialty Cluster NameNode DataNode

NameNode DataNode SecNmNode DataNode Specialty Cluster NameNode DataNode 1.Create new “Odd Shape” cluster 2.Optimize Hardware to fit task 3.Oops! extra copy of Data 4.Easier to contain variation

Copyright © 2013, SAS Institute Inc. All rights reserved. EXAMPLE ASYMMETRIC AS AN OPTION NameNode Client Controller

TERADATA CLIENT ORACLE HADOOP DataCenter Integration

Agenda 1.Basics 2.Cluster Evolution Vanilla Cluster Foreign Workload Introduced Node Specialization Cluster Specialization Datacenter Integration 1.YARN 2.Security

2013q4? 2014?

NameNodeDataNode SecNmNodeDataNode Node Specialization (for foreign workload)

Agenda 1.Basics 2.Cluster Evolution Vanilla Cluster Foreign Workload Introduced Node Specialization Cluster Specialization Datacenter Integration 1.YARN 2.Security

Security is Hard. Better Start right away.  Add Kerberos to your environment ASAP – right after the first POC  Integrate with the identity management on site -Don’t add unix-users to the cluster by hand! -Automate. -Engage SAS Technical Resources. -Security settings can be hard to get right. Error messages get obfuscated and tracking the true source is difficult -Easier to start with a small working system and add projects  Resist “Oh, we will add the security later”. Your users will have gotten so used to no- security they’l scream!

Thank You! paulmkent