Daniel Templeton, Cloudera, Inc.

Slides:

Advertisements

Similar presentations

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.

Advertisements

© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Jim Donahue | Principal Scientist Adobe Systems Technology Lab Flint: Making.

Resource Management with YARN: YARN Past, Present and Future

Reproducible Environment for Scientific Applications (Lab session) Tak-Lon (Stephen) Wu.

Installing and Setting up mongoDB replica set PREPARED BY SUDHEER KONDLA SOLUTIONS ARCHITECT.

Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.

What’s New in Kinetic Task 3.0 Ben Christenson 3 About Me  Ben Christenson  Employee at Kinetic Data for 13 years and a member of the Product Development.

M1G Introduction to Programming 2 4. Enhancing a class:Room.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

Towards Establishing a Local ORCA Instance Shade EL-Hadik Deniz Gurkan University of Houston 7th GENI Engineering Conference 03/16/2010 GEC7 – ORCA-D.

Introduction to AFS IMSA Intersession 2003 AFS Servers and Clients Brian Sebby, IMSA ‘96 Copyright 2003 by Brian Sebby, Copies of these.

National Energy Research Scientific Computing Center (NERSC) CHOS - CHROOT OS Shane Canon NERSC Center Division, LBNL SC 2004 November 2004.

SQL SERVER 2008 Installation Guide A Step by Step Guide Prepared by Hassan Tariq.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

Apache Tez : Accelerating Hadoop Query Processing Page 1.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

CompTIA Server+ Certification (Exam SK0-004)

Getting & Running EdgeX Docker Containers

Bentley Systems, Incorporated

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Hadoop Architecture Mr. Sriram

Machine Learning Library for Apache Ignite

Dockerize OpenEdge Srinivasa Rao Nalla.

IBM Workload Scheduler 2015 Take the Complexity Out of Workload Automation, while Keeping the Technology Up-to-Date IEM fixlets and Centralized Agent Update.

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

HTCondor Security Basics

Unit 2 Hadoop and big data

Spark and YARN: Better Together

High Availability Linux (HA Linux)

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Apache Bigtop Practical Workshop.

UBUNTU INSTALLATION

Spark Presentation.

Neelesh Srinivas Salian

Containers and Virtualisation

Pyspark 최 현 영 컴퓨터학부.

Apache Hadoop YARN: Yet Another Resource Manager

Containers in HPC By Raja.

Software Engineering Introduction to Apache Hadoop Map Reduce

Beijing S3P test strategy Eric Debeau, Sylvain Desbureaux, Morgan Richomme December 12, 2017.

Integration of Singularity With Makeflow

Windows Server & Hyper-V Containers Vaggelis Kappas

Overview Introduction VPS Understanding VPS Architecture

Tech Inside Extended Document Management System (EDMS)

Introduction to Introduction to Singularity

HDFS on Kubernetes -- Lessons Learned

IS3440 Linux Security Unit 4 Securing the Linux Filesystem

HTCondor Security Basics HTCondor Week, Madison 2016

Data Security for Microsoft Azure

Sahara Project Onboarding Telles Nobrega

Hadoop Technopoints.

Introduction to Apache

Presented By - Avinash Pawar

Dovetail & CVP Tutorial/Demo

HDFS on Kubernetes -- Lessons Learned

Container cluster management solutions

Step by step installation of a Domino server on Docker

Lecture 16 (Intro to MapReduce and Hadoop)

Quattor Advanced Tutorial, LAL

Setting up your dev environment

Containerized Spark at RBC

Architecture Agnostic Docker Build Systems

Hadoop Installation Fully Distributed Mode

Azure Container Service

David Cleverly – Development Lead

Container technology, Microservices, and DevOps

CS639: Data Management for Data Science

Docker and Kubernetes Security in ONAP Pawel Pawlak Amy Zwarico

Presentation transcript:

Daniel Templeton, Cloudera, Inc. Docker on Hadoop Daniel Templeton, Cloudera, Inc. daniel@cloudera.com @templedf

Me

One Slide on Docker Same general idea as a VM BUT there’s only one OS image Partitioned process space Layered images Image repo

One Slide on Hadoop HDFS HDFS HDFS YARN YARN MapReduce v1 Three core components HDFS YARN MapReduce YARN HDFS MapReduce v2 YARN HDFS MRv2 Spark ... HDFS MapReduce v1

Why Docker on Hadoop? Process isolation CGroups for resource isolation Adds process & environment isolation Image isolation Control execution environment Libraries JVM OS Unsafe operations

Launching Jobs YARN Resource Manager Node Manager Container Executor Process

Container Executor DefaultContainerExecutor LinuxContainerExecutor Write a launch script ProcessBuilder.start() LinuxContainerExecutor Write a launch script Launch native handler Set UID CGroups Fork & exec Required for secure

Container Executor DefaultContainerExecutor Write a launch script ProcessBuilder.start() LinuxContainerExecutor Write a launch script Launch native handler Set UID CGroups Fork & exec Required for secure DockerContainerExecutor Write a launch script ProcessBuilder.start() docker run

Container Executor DefaultContainerExecutor Write a launch script ProcessBuilder.start() LinuxContainerExecutor Write a launch script Launch native handler OR Launch Docker handler docker run Required for secure DockerContainerExecutor Write a launch script ProcessBuilder.start() docker run

2.6.0 2.8.0 Container Executor DefaultContainerExecutor Write a launch script ProcessBuilder.start() LinuxContainerExecutor Write a launch script Launch native handler OR Launch Docker handler docker run Required for secure DockerContainerExecutor Write a launch script ProcessBuilder.start() Docker run BORN 2.6.0 DIED 2.8.0

Secret Formula How to run a Docker container through YARN Setup LCE Setup Docker Configure yarn-site.xml Configure container-executor.cfg Prepare Docker image Launch job

Setup LCE LCE uses container-executor binary Must be owned by root Group must be same as node manager's group Must have setuid and setgid bits set Must be r+x only by the node manager's group Owner: root, Group: hadoop, Mode: 6050 Which relies on container-executor.cfg Must not be writable by any other than root

Setup Docker Docker must be installed on all node manager nodes Node labels can be used to label those nodes Only capacity scheduler Only one label per host May be a good idea to pre-cache images that will be used

Configure yarn-site.xml yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor yarn.nodemanager.linux-container-executor.group = hadoop (or whatever group the node manager uses) yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false (typically) yarn.nodemanager.runtime.linux.docker.allowed-container-networks yarn.nodemanager.runtime.linux.docker.default-container-network yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed yarn.nodemanager.runtime.linux.docker.privileged-containers.acl ...

Configure container-executor.cfg yarn.nodemanager.linux-container-executor.group = hadoop (or whatever group the node manager uses) feature.docker.enabled = true min.user.id banned.users allowed.system.users docker.binary ...

Prepare the Docker Image Application owner (UID) must exist Execution requirements Hadoop → JRE, Hadoop libraries, env vars Must be compatible with cluster and other images No entry point, no command

Launch the Job Do whatever you normally do Use of Docker containers managed through env vars YARN_CONTAINER_RUNTIME_TYPE YARN_CONTAINER_RUNTIME_DOCKER_IMAGE YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK YARN_CONTAINER_RUNTIME_DOCKER_RUN_PRIVILEGED_CONTAINER YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS

Example: MapReduce $ vars="YARN_CONTAINER_RUNTIME_TYPE=docker” $ vars=”$vars,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop" $ hadoop jar hadoop-examples.jar pi \ -Dyarn.app.mapreduce.am.env=$vars \ -Dmapreduce.map.env=$vars \ -Dmapreduce.reduce.env=$vars \ 10 100

Example: Spark $ spark-shell --master yarn \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop \ --conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop \ --conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

Caveats

Caveats Application owner must exist in Docker container Limits flexibility of containers Automatically mounts in /etc/passwd Bad solution Broken Removed in Hadoop 2.9/3.0 (YARN-5394) Discussion on YARN-5360 and YARN-4266

Caveats Application owner must exist in Docker container Hadoop artifacts must exist in Docker containers Docker containers must be self-contained HDFS access, deserializing tokens, etc. Versions must be compatible Complicates cluster upgrades YARN-5534 will allow whitelisted volume mounts

Caveats Application owner must exist in Docker container Hadoop artifacts must exist in Docker containers Large images may fail Images that aren't cached are implicitly pulled Large images may take a while MapReduce and Spark time out after 10 minutes YARN-3854 is a step towards a solution

Caveats Application owner must exist in Docker container Hadoop artifacts must exist in Docker containers Large images may fail No real support for secure image repos Docker stores credentials in client config Always set to $HOME/.docker/config.json YARN-5428 will make the client config configurable

Caveats Application owner must exist in Docker container Hadoop artifacts must exist in Docker containers Large images may fail No real support for secure image repos Not really useful before Hadoop 2.9/3.0 YARN-5298: Mounts localized file directories as volumes YARN-4553: CGroups support YARN-4007: Support different networking options YARN-5258: Documentation

Apache Slider YARN is traditionally a job scheduler What about services? Slider simplifies running a service on YARN Is itself a YARN application Declarative Docker support as of Slider 0.80 Slider agent calls docker run Unrelated to YARN Docker support

Slider in YARN Slider core moving into YARN YARN-5079: Native YARN framework layer for services and beyond Slider agent is not being integrated Using YARN instead Docker support through YARN Currently only in yarn-native-services branch Merge date not set yet “Classic” Slider will continue to be available

So Many Choices! YES YES NO Hadoop 2.8.0 YES NO NO stable Classic Slider on YARN YES YES services Native Services Slider NO stable Hadoop 2.8.0 YES NO Hadoop branch-2 or trunk NO

Summary Docker adds good things to YARN There are a few thorns YARN natively supports Docker Limited use until Hadoop 2.9/3.0 Slider natively supports Docker Slider is moving into YARN and adopting YARN's Docker support

Daniel Templeton, Cloudera, Inc. Thank you! Daniel Templeton, Cloudera, Inc. daniel@cloudera.com @templedf