Interactive Job in DLWorkspace

Slides:



Advertisements
Similar presentations
Chien-Chung Shen Google Compute Engine Chien-Chung Shen
Advertisements

Virtual Machine and UNIX. What is a VM? VM stands for Virtual Machine. It is a software emulation of hardware. By using a VM, you can have the same hardware.
Amazon EC2 Quick Start adapted from EC2_GetStarted.html.
Sebastien Docker containers …. Background Joined Citrix OSS team in July 2012 Associate professor at Clemson University prior High Performance.
Customized cloud platform for computing on your terms !
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
 Configuring a vSwitch Cloud Computing (ISM) [NETW1009]
Introduction to CVMFS A way to distribute HEP software on cloud Tian Yan (IHEP Computing Center, BESIIICGEM Cloud Computing Summer School.
Chapter 3 & 6 Root Status and users File Ownership Every file has a owner and group –These give read,write, and execute priv’s to the owner, group, and.
Topics Network topology Virtual LAN Port scanners and utilities Packet sniffers Weak protocols Practical exercise.
VIRTUAL HOSTING WITH PureFTPd And MYSQL (Quota And Bandwidth Management) BY Odoh Kenneth Emeka Sun Yu Patrick Appiah.
#msitconf. Damien Caro Technical Evangelist Manager, Что будет, если приложение поместить в контейнер? What happens if the application.
Advanced Computing Facility Introduction
LinuxCon ContainerCon CloudOpen China June 19, 2017
6. The Open Network Lab Overview and getting started
Job Scheduling and Runtime in DLWorkspace
Microservice Bus Tutorial Huabing Zhao
What has Docker Done for Us?
ONAP/K8S Deployment OOM Team
Lab 05 Firewalls.
Dockerize OpenEdge Srinivasa Rao Nalla.
Kubernetes Modifications for GPUs
Customized cloud platform for computing on your terms !
NTP, Syslog & Secure Shell
ECE 544: Middlebox lab Abhigyan Sharma.
Working With Azure Batch AI
Selecting Unicast or Multicast Mode
Web Hosting with OpenShift
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Introduction to CVMFS A way to distribute HEP software on cloud
Tools and Services Workshop Overview of Atmosphere
TensorFlow on Kubernetes with GPU Enabled
Machine Learning Workshop
Containers and Virtualisation
IT Atoumation / Conf. Mgmt...
Vagrant Managing Virtual Machines
Introduction to Networking
SSSD and OpenSSH Integration
In-Memory Performance
ONAP/OOM for Developers Michael O’Brien | Amdocs
Deploy OpenStack with Ubuntu Autopilot
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Diego Scardaci (EGI.eu)
Kubernetes Container Orchestration
Using docker containers
Kubernetes intro.
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
HDFS on Kubernetes -- Lessons Learned
Intro to Docker Containers and Orchestration in the Cloud
Declarative application management Mixing the old with the new
OpenStack-alapú privát felhő üzemeltetés
Getting Started with Kubernetes and Rancher 2.0
Introduction to Apache
CUPS Print Services.
HDFS on Kubernetes -- Lessons Learned
Managing Services with VMM and App Controller
Orchestration & Container Management in EGI FedCloud
Container cluster management solutions
REDCap and Data Governance
THE REALITY OF USING CONTAINERS TO BUILD PRODUCTS
Chapter 2: Operating-System Structures
Remote Computing Services Cloud connection Distributed system
Testing inside of Kubernetes and Openshift
OpenShift as a cloud for Data Science
Kubernetes.
OpenStack Summit Berlin – November 14, 2018
Chapter 2: Operating-System Structures
VirgoStaging Status F.Carbognani, S. Cortese, E. Pacaud.
Preventing Privilege Escalation
Docker and Kubernetes Security in ONAP Pawel Pawlak Amy Zwarico
Presentation transcript:

Interactive Job in DLWorkspace Cloud Computing and Storage group July 10th, 2017

Interactive Job Type Training jobs are only a part of the researcher’s daily jobs. Most of their time is used on exploration, debugging the model. The users would like to use the most familiar environment. We want to reduce the running environment gap between the cloud and their own machine. Give user flexibility to run any type of interactive jobs Make it convenient to use by providing the pre-defined job templates Interactive jobs: Ipython SSH (Tensorboard) etc…

Networking Container networking: flannel Container ports in Kubernetes Service IP and ports: NodePort NIC mapping

Networking Flannel: Flannel is a virtual network that gives a subnet to each host for use with container runtimes. One virtual IP per container. Support cross machine container communication. PROs: easy to use; the cleanest way to handle ports allocation CONs: performance (perf)

Kubernetes Networking Support Service IP and ports: flannel is used In service spec, include the container selector and the container ports which are needed to be exposed. Kubernetes will provide a cluster-only virtual IP and port which can be used to access the container at the designed port. kind: Service apiVersion: v1 metadata: name: {{ svc["serviceId"] }} labels: run: {{ svc["jobId"] }} spec: selector: ports: - name: {{ svc["port-name"] }} protocol: {{ svc["port-type"] }} port: {{ svc["port"] }}

Kubernetes Networking Support NodePort: flannel is not required In service spec, include the container selector and the container ports. Kubernetes will automatically select an usable port on the host machine and map the host port to the container port. kind: Service apiVersion: v1 metadata: name: {{ svc["serviceId"] }} labels: run: {{ svc["jobId"] }} spec: type: NodePort selector: ports: - name: {{ svc["port-name"] }} protocol: {{ svc["port-type"] }} port: {{ svc["port"] }}

Kubernetes Networking Support NIC mapping: Best performance for distributed training jobs Map NIC to container directly. apiVersion: v1 kind: Pod metadata: name: {{ job["jobId"] }}-{{ job["distId"] }} labels: run: {{ job["jobId"] }} jobName: {{ job["jobNameLabel"] }} distRole: {{ job["distRole"] }} distPort: "{{job["containerPort"]}}" spec: hostNetwork: true {% if job["nodeSelector"]|length > 0 %} nodeSelector: {% for key, value in job["nodeSelector"].items() %} {{key}}: {{value}} {% endfor %} {% endif %} containers: - name: {{ job["jobId"] }} image: {{ job["image"] }} command: {{ job["LaunchCMD"] }} #container port and host port should be same. ports: - containerPort: {{job["containerPort"]}} hostPort: {{job["containerPort"]}} {% if job["distRole"] =="worker" %} resources: limits: alpha.kubernetes.io/nvidia-gpu: {{ job["resourcegpu"] }} {% endif %} volumeMounts: {% for mp in job["mountPoints"] %} - mountPath: {{ mp.containerPath }} name: {{ mp.name }} {% endfor %} restartPolicy: Never volumes: - name: {{ mp.name }} hostPath: path: {{ mp.hostPath }}

Networking – expose ports Training Jobs: Map NICs to container Provide usable ports in commend line parameters and environment variables How to force user to use the designed ports? Interactive Jobs: Expose ports for http access, ssh access, etc. (lightweight traffic) Use Kubernetes NodePort ( == docker port mapping)

Launch the interactive jobs Job templates: Per-config job parameters (docker image, command line): e.g. tensorflow ipython: Docker image: tensorflow/tensorflow:latest Command line: export HOME=/job && jupyter notebook --no-browser --port=8888 -- ip=0.0.0.0 --notebook-dir=/ Tensorflow ssh: Docker image: tensorflow/tensorflow:latest-gpu Command line: apt-get update && apt-get install -y openssh-server sudo && addgroup -- force-badname --gid 500000513 domainusers && adduser --force-badname --home /home/hongzl --shell /bin/bash --uid 522318884 -gecos '' hongzl --disabled-password --gid 500000513 && adduser hongzl sudo && echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers && mkdir -p /root/.ssh && cat /work/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys && mkdir -p /home/hongzl/.ssh && cat /work/.ssh/id_rsa.pub >> /home/hongzl/.ssh/authorized_keys && service ssh restart && sleep infinity

Policy (Open Discussion…) Interactive job cloud be expensive. Need to design efficient policy.

Job Scheduling

Discussion How to configure GPU resource quota for each team? How to implement preemption?

How to configure GPU resource quota for each team? https://github.com/MSRCCS/DLWorkspace/blob/alpha.v1.0/src/Clust erManager/job_manager.py if check_quota(job): SubmitJob(job)

Support Job Priority? https://github.com/MSRCCS/DLWorkspace/blob/alpha.v1.0/src/Clust erManager/job_manager.py pendingJobs = get_job_priority(pendingJobs)

How to implement preemption? Jobs are needed to be labeled as “allow preemption”: Find the jobs can be preempted: kubectl get pod -o yaml --show-all -l preemption=allow Preempted job Kill the jobs from k8s Make the job status to “queued” to allow rescheduling. apiVersion: v1 kind: Pod metadata: name: {{ job["jobId"] }} labels: run: {{ job["jobId"] }} jobName: {{ job["jobNameLabel"] }} userName: {{ job["userNameLabel"] }} preemption : allow