TensorFlow on Kubernetes with GPU Enabled

Slides:



Advertisements
Similar presentations
WSUS Presented by: Nada Abdullah Ahmed.
Advertisements

Google App Engine Cloud B. Ramamurthy 7/11/2014CSE651, B. Ramamurthy1.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
VISITBASIS Introduction and System Overview. I NTRODUCTION About VisitBasis VisitBasis Retail Execution is a cloud-based complete mobile data collection.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Configuring, Managing and Maintaining Windows Server® 2008 Servers Course 6419A.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
Network and Server Basics. Learning Objectives After viewing this presentation, you will be able to: Understand the benefits of a client/server network.
Docker for Ops: Operationalize Your Apps in Production Vivek Saraswat Sr. Product Evan Hazlett Sr. Software
Elara Introduction Wentao Zhang? (NOTE: PASTE IN PORTRAIT AND SEND BEHIND FOREGROUND GRAPHIC FOR CROP)
Compliance Management System. Intelex System Overview Focus Modules: –Permits Management –Monitoring & Measurement –Training Management –Document Control.
BY: SALMAN 1.
Job Scheduling and Runtime in DLWorkspace
BUILD SECURE PRODUCTS AND SERVICES
Getting & Running EdgeX Docker Containers
Applied Operating System Concepts
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
Configuration and Monitoring
The Future? Or the Past and Present?
Interactive Job in DLWorkspace
BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.
BY: SALMAN.
Quality Assurance System Field Service Automation
DL (Deep Learning) Workspace
Volunteer Computing for Science Gateways
Kubernetes Modifications for GPUs
Working With Azure Batch AI
Docker Birthday #3.
GWE Core Grid Wizard Enterprise (
Data Interface Module Leighton Wingerd & Manisha Kollu
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.
1. Public Network - Each Rackspace Cloud Server has two networks
Andrew Pruski SQL Server & Containers
Introduction to Computers
Introduction To Networking
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
TYPES OF SERVER. TYPES OF SERVER What is a server.
Quality Assurance System Field Service Automation
Dumps PDF Architecting Microsoft Azure Solutions Are You Worried About Your Exam…
Virtualization & Security real solutions
Printers.
Kubernetes Container Orchestration
Setting policies in kubernetes
Azure Container Service - the most open container orchestration service yet Saurya Das Program Manager.
Using docker containers
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
Getting Started.
Image Processing Platform
Data science and machine learning at scale, powered by Jupyter
Getting Started with Kubernetes and Rancher 2.0
Introduction to Apache
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Getting Started.
Operating System Concepts
Container cluster management solutions
Quality Assurance System Field Service Automation
Quality Assurance System Field Service Automation
Configuration Management at its peak with
Windows 8.1 Deployment Jump Start
TensorFlow: A System for Large-Scale Machine Learning
Microsoft Virtual Academy
Operating System Concepts
Comodo Dome Data Protection
Preparing for the Windows 8.1 MCSA
Docker and Kubernetes Security in ONAP Pawel Pawlak Amy Zwarico
Presentation transcript:

TensorFlow on Kubernetes with GPU Enabled Zeyu, Zheng, Chief Data Scientist, Caicloud Huizhi, Zhao, Director of Engineering, Caicloud

Agenda TensorFlow Distributed TensorFlow GPU capabilities in Kubernetes TensorFlow on Kubernetes with GPU enabled TensorFlow as a Service (TaaS)

Deep Learning

Deep Learning

Deep Learning

Deep Learning Before After

TensorFlow Introduction

TensorFlow Introduction

TensorFlow Introduction

Deploy TensorFlow on 1 server Please be aware of that TensorFlow already support Nvidia GPU It is easy to deploy TensorFlow on 1 server (or on your laptop) Nvidia GPU requirement: GPU card with CUDA Compute Capability 3.0 or higher. CUDA toolkit 7.0 or greater cuDNN v3 or greater

Distributed TensorFlow environment Reference: https://www.tensorflow.org/extend/architecture

Distributed TensorFlow with GPU enabled

Distributed TensorFlow causes management mayhem

GPU capabilities in Kubernetes Device mapping in Docker Discover GPUs in kubelet Assign/free GPUs in kubelet Schedule GPU resource in kube-scheduler

Device mapping in Docker docker run -it --device /dev/nvidia0:/dev/nvidia0 \ --device /dev/nvidia1:/dev/nvidia1 \ --device /dev/nvidiactl:/dev/nvidiactl \ --device /dev/nvidia-uvm:/dev/nvidia-uvm \ tf-cuda:v1.1beta /bin/bash Docker inspect

Discover GPUs in kubelet

Assign/Free GPUs in kubelet Kubelet manage which GPUs should be assigned to a new container. And reset the GPU to free once the container killed/dead

Schedule GPU resource in kube-scheduler Kube-scheduler knows how many free GPUs on each kubelet. 2. Only dedicated GPUs support. 3. 1 GPU only can be assigned to 1 container now, but 1 container could has more than 1 GPU.

What should we do next? CRI support NVML support GPU driver volume support NCCL support

TensorFlow on Kubernetes

TensorFlow on Kubernetes with GPU enabled

Best practice Reduce the network communication could save the bandwidth lack. Use high frequency GPUs rather than GPUs for servers. Always save your training and serving data on a volume rather than save them inside container, not only for the training and serving model, but also the training can recover and get the training steps. Deploy multiple Parameter Server and deploy them on different server, it could balance the network.

What is TensorFlow as a Service (TaaS) 网页截图 TaaS = hosted, managed, and optimized TensorFlow with multiple developed models for real-world industrial solutions

Compare with original TensorFlow environment Operation Original TensorFlow Caicloud TaaS Environment Setup Single server pip or docker image Integrated Caicloud TaaS image. Distributed environment Setup server one by one No need to do anything Resource management Usually, TensorFlow occupy all the resources Based on Kubernetes, resources could isolated. Module Training Training User need to config every parameter on each server Upload your model file, config the parameter on the web UI. Monitor Management Monitor Save logs and config TensorBoard manually Save logs/config TensorBoard automatically Model Serving Model API serving User need to implement it himself. Export TensorFlow Model automatically, and support RESTful and gRPC model online serving

Main features – Training Configure

Main features – Training Monitor

Main features - Model Serving Host

Main features - Storage

TaaS Training Resource Queue

TaaS Training Resource Pool

AI general case

What our company produce

Contract US Facebook https://www.facebook.com/Caicloud-108906236310777/ Twitter https://twitter.com/Caicloud_io

Q & A Thank you