Interactive Job in DLWorkspace

Slides:

Advertisements

Similar presentations

Chien-Chung Shen Google Compute Engine Chien-Chung Shen

Advertisements

Virtual Machine and UNIX. What is a VM? VM stands for Virtual Machine. It is a software emulation of hardware. By using a VM, you can have the same hardware.

Amazon EC2 Quick Start adapted from EC2_GetStarted.html.

Sebastien Docker containers …. Background Joined Citrix OSS team in July 2012 Associate professor at Clemson University prior High Performance.

Customized cloud platform for computing on your terms !

Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.

 Configuring a vSwitch Cloud Computing (ISM) [NETW1009]

Introduction to CVMFS A way to distribute HEP software on cloud Tian Yan (IHEP Computing Center, BESIIICGEM Cloud Computing Summer School.

Chapter 3 & 6 Root Status and users File Ownership Every file has a owner and group –These give read,write, and execute priv’s to the owner, group, and.

Topics Network topology Virtual LAN Port scanners and utilities Packet sniffers Weak protocols Practical exercise.

VIRTUAL HOSTING WITH PureFTPd And MYSQL (Quota And Bandwidth Management) BY Odoh Kenneth Emeka Sun Yu Patrick Appiah.

#msitconf. Damien Caro Technical Evangelist Manager, Что будет, если приложение поместить в контейнер? What happens if the application.

Advanced Computing Facility Introduction

LinuxCon ContainerCon CloudOpen China June 19, 2017

6. The Open Network Lab Overview and getting started

Job Scheduling and Runtime in DLWorkspace

Microservice Bus Tutorial Huabing Zhao

What has Docker Done for Us?

ONAP/K8S Deployment OOM Team

Lab 05 Firewalls.

Dockerize OpenEdge Srinivasa Rao Nalla.

Kubernetes Modifications for GPUs

Customized cloud platform for computing on your terms !

NTP, Syslog & Secure Shell

ECE 544: Middlebox lab Abhigyan Sharma.

Working With Azure Batch AI

Selecting Unicast or Multicast Mode

Web Hosting with OpenShift

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

Introduction to CVMFS A way to distribute HEP software on cloud

Tools and Services Workshop Overview of Atmosphere

TensorFlow on Kubernetes with GPU Enabled

Machine Learning Workshop

Containers and Virtualisation

IT Atoumation / Conf. Mgmt...

Vagrant Managing Virtual Machines

Introduction to Networking

SSSD and OpenSSH Integration

In-Memory Performance

ONAP/OOM for Developers Michael O’Brien | Amdocs

Deploy OpenStack with Ubuntu Autopilot

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

Diego Scardaci (EGI.eu)

Kubernetes Container Orchestration

Using docker containers

Kubernetes intro.

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

HDFS on Kubernetes -- Lessons Learned

Intro to Docker Containers and Orchestration in the Cloud

Declarative application management Mixing the old with the new

OpenStack-alapú privát felhő üzemeltetés

Getting Started with Kubernetes and Rancher 2.0

Introduction to Apache

CUPS Print Services.

HDFS on Kubernetes -- Lessons Learned

Managing Services with VMM and App Controller

Orchestration & Container Management in EGI FedCloud

Container cluster management solutions

REDCap and Data Governance

THE REALITY OF USING CONTAINERS TO BUILD PRODUCTS

Chapter 2: Operating-System Structures

Remote Computing Services Cloud connection Distributed system

Testing inside of Kubernetes and Openshift

OpenShift as a cloud for Data Science

OpenStack Summit Berlin – November 14, 2018

Chapter 2: Operating-System Structures

VirgoStaging Status F.Carbognani, S. Cortese, E. Pacaud.

Preventing Privilege Escalation

Docker and Kubernetes Security in ONAP Pawel Pawlak Amy Zwarico

Presentation transcript:

Interactive Job in DLWorkspace Cloud Computing and Storage group July 10th, 2017

Interactive Job Type Training jobs are only a part of the researcher’s daily jobs. Most of their time is used on exploration, debugging the model. The users would like to use the most familiar environment. We want to reduce the running environment gap between the cloud and their own machine. Give user flexibility to run any type of interactive jobs Make it convenient to use by providing the pre-defined job templates Interactive jobs: Ipython SSH (Tensorboard) etc…

Networking Container networking: flannel Container ports in Kubernetes Service IP and ports: NodePort NIC mapping

Networking Flannel: Flannel is a virtual network that gives a subnet to each host for use with container runtimes. One virtual IP per container. Support cross machine container communication. PROs: easy to use; the cleanest way to handle ports allocation CONs: performance (perf)

Kubernetes Networking Support Service IP and ports: flannel is used In service spec, include the container selector and the container ports which are needed to be exposed. Kubernetes will provide a cluster-only virtual IP and port which can be used to access the container at the designed port. kind: Service apiVersion: v1 metadata: name: {{ svc["serviceId"] }} labels: run: {{ svc["jobId"] }} spec: selector: ports: - name: {{ svc["port-name"] }} protocol: {{ svc["port-type"] }} port: {{ svc["port"] }}

Kubernetes Networking Support NodePort: flannel is not required In service spec, include the container selector and the container ports. Kubernetes will automatically select an usable port on the host machine and map the host port to the container port. kind: Service apiVersion: v1 metadata: name: {{ svc["serviceId"] }} labels: run: {{ svc["jobId"] }} spec: type: NodePort selector: ports: - name: {{ svc["port-name"] }} protocol: {{ svc["port-type"] }} port: {{ svc["port"] }}

Kubernetes Networking Support NIC mapping: Best performance for distributed training jobs Map NIC to container directly. apiVersion: v1 kind: Pod metadata: name: {{ job["jobId"] }}-{{ job["distId"] }} labels: run: {{ job["jobId"] }} jobName: {{ job["jobNameLabel"] }} distRole: {{ job["distRole"] }} distPort: "{{job["containerPort"]}}" spec: hostNetwork: true {% if job["nodeSelector"]|length > 0 %} nodeSelector: {% for key, value in job["nodeSelector"].items() %} {{key}}: {{value}} {% endfor %} {% endif %} containers: - name: {{ job["jobId"] }} image: {{ job["image"] }} command: {{ job["LaunchCMD"] }} #container port and host port should be same. ports: - containerPort: {{job["containerPort"]}} hostPort: {{job["containerPort"]}} {% if job["distRole"] =="worker" %} resources: limits: alpha.kubernetes.io/nvidia-gpu: {{ job["resourcegpu"] }} {% endif %} volumeMounts: {% for mp in job["mountPoints"] %} - mountPath: {{ mp.containerPath }} name: {{ mp.name }} {% endfor %} restartPolicy: Never volumes: - name: {{ mp.name }} hostPath: path: {{ mp.hostPath }}

Networking – expose ports Training Jobs: Map NICs to container Provide usable ports in commend line parameters and environment variables How to force user to use the designed ports? Interactive Jobs: Expose ports for http access, ssh access, etc. (lightweight traffic) Use Kubernetes NodePort ( == docker port mapping)

Launch the interactive jobs Job templates: Per-config job parameters (docker image, command line): e.g. tensorflow ipython: Docker image: tensorflow/tensorflow:latest Command line: export HOME=/job && jupyter notebook --no-browser --port=8888 -- ip=0.0.0.0 --notebook-dir=/ Tensorflow ssh: Docker image: tensorflow/tensorflow:latest-gpu Command line: apt-get update && apt-get install -y openssh-server sudo && addgroup -- force-badname --gid 500000513 domainusers && adduser --force-badname --home /home/hongzl --shell /bin/bash --uid 522318884 -gecos '' hongzl --disabled-password --gid 500000513 && adduser hongzl sudo && echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers && mkdir -p /root/.ssh && cat /work/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys && mkdir -p /home/hongzl/.ssh && cat /work/.ssh/id_rsa.pub >> /home/hongzl/.ssh/authorized_keys && service ssh restart && sleep infinity

Policy (Open Discussion…) Interactive job cloud be expensive. Need to design efficient policy.

Job Scheduling

Discussion How to configure GPU resource quota for each team? How to implement preemption?

How to configure GPU resource quota for each team? https://github.com/MSRCCS/DLWorkspace/blob/alpha.v1.0/src/Clust erManager/job_manager.py if check_quota(job): SubmitJob(job)

Support Job Priority? https://github.com/MSRCCS/DLWorkspace/blob/alpha.v1.0/src/Clust erManager/job_manager.py pendingJobs = get_job_priority(pendingJobs)

How to implement preemption? Jobs are needed to be labeled as “allow preemption”: Find the jobs can be preempted: kubectl get pod -o yaml --show-all -l preemption=allow Preempted job Kill the jobs from k8s Make the job status to “queued” to allow rescheduling. apiVersion: v1 kind: Pod metadata: name: {{ job["jobId"] }} labels: run: {{ job["jobId"] }} jobName: {{ job["jobNameLabel"] }} userName: {{ job["userNameLabel"] }} preemption : allow