TensorFlow on Kubernetes with GPU Enabled

Slides:

Advertisements

Similar presentations

WSUS Presented by: Nada Abdullah Ahmed.

Advertisements

Google App Engine Cloud B. Ramamurthy 7/11/2014CSE651, B. Ramamurthy1.

Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

VISITBASIS Introduction and System Overview. I NTRODUCTION About VisitBasis VisitBasis Retail Execution is a cloud-based complete mobile data collection.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Configuring, Managing and Maintaining Windows Server® 2008 Servers Course 6419A.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.

Network and Server Basics. Learning Objectives After viewing this presentation, you will be able to: Understand the benefits of a client/server network.

Docker for Ops: Operationalize Your Apps in Production Vivek Saraswat Sr. Product Evan Hazlett Sr. Software

Elara Introduction Wentao Zhang? (NOTE: PASTE IN PORTRAIT AND SEND BEHIND FOREGROUND GRAPHIC FOR CROP)

Compliance Management System. Intelex System Overview Focus Modules: –Permits Management –Monitoring & Measurement –Training Management –Document Control.

Job Scheduling and Runtime in DLWorkspace

BUILD SECURE PRODUCTS AND SERVICES

Getting & Running EdgeX Docker Containers

Applied Operating System Concepts

RHEV Platform at LHCb Red Hat at CERN 17-18/1/17

Configuration and Monitoring

The Future? Or the Past and Present?

Interactive Job in DLWorkspace

BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.

Quality Assurance System Field Service Automation

DL (Deep Learning) Workspace

Volunteer Computing for Science Gateways

Kubernetes Modifications for GPUs

Working With Azure Batch AI

Docker Birthday #3.

GWE Core Grid Wizard Enterprise (

Data Interface Module Leighton Wingerd & Manisha Kollu

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.

1. Public Network - Each Rackspace Cloud Server has two networks

Andrew Pruski SQL Server & Containers

Introduction to Computers

Introduction To Networking

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

TYPES OF SERVER. TYPES OF SERVER What is a server.

Quality Assurance System Field Service Automation

Dumps PDF Architecting Microsoft Azure Solutions Are You Worried About Your Exam…

Virtualization & Security real solutions

Kubernetes Container Orchestration

Setting policies in kubernetes

Azure Container Service - the most open container orchestration service yet Saurya Das Program Manager.

Using docker containers

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.

Getting Started.

Image Processing Platform

Data science and machine learning at scale, powered by Jupyter

Getting Started with Kubernetes and Rancher 2.0

Introduction to Apache

Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source

Getting Started.

Operating System Concepts

Container cluster management solutions

Quality Assurance System Field Service Automation

Quality Assurance System Field Service Automation

Configuration Management at its peak with

Windows 8.1 Deployment Jump Start

TensorFlow: A System for Large-Scale Machine Learning

Microsoft Virtual Academy

Operating System Concepts

Comodo Dome Data Protection

Preparing for the Windows 8.1 MCSA

Docker and Kubernetes Security in ONAP Pawel Pawlak Amy Zwarico

Presentation transcript:

TensorFlow on Kubernetes with GPU Enabled Zeyu, Zheng, Chief Data Scientist, Caicloud Huizhi, Zhao, Director of Engineering, Caicloud

Agenda TensorFlow Distributed TensorFlow GPU capabilities in Kubernetes TensorFlow on Kubernetes with GPU enabled TensorFlow as a Service (TaaS)

Deep Learning

Deep Learning

Deep Learning

Deep Learning Before After

TensorFlow Introduction

TensorFlow Introduction

TensorFlow Introduction

Deploy TensorFlow on 1 server Please be aware of that TensorFlow already support Nvidia GPU It is easy to deploy TensorFlow on 1 server (or on your laptop) Nvidia GPU requirement: GPU card with CUDA Compute Capability 3.0 or higher. CUDA toolkit 7.0 or greater cuDNN v3 or greater

Distributed TensorFlow environment Reference: https://www.tensorflow.org/extend/architecture

Distributed TensorFlow with GPU enabled

Distributed TensorFlow causes management mayhem

GPU capabilities in Kubernetes Device mapping in Docker Discover GPUs in kubelet Assign/free GPUs in kubelet Schedule GPU resource in kube-scheduler

Device mapping in Docker docker run -it --device /dev/nvidia0:/dev/nvidia0 \ --device /dev/nvidia1:/dev/nvidia1 \ --device /dev/nvidiactl:/dev/nvidiactl \ --device /dev/nvidia-uvm:/dev/nvidia-uvm \ tf-cuda:v1.1beta /bin/bash Docker inspect

Discover GPUs in kubelet

Assign/Free GPUs in kubelet Kubelet manage which GPUs should be assigned to a new container. And reset the GPU to free once the container killed/dead

Schedule GPU resource in kube-scheduler Kube-scheduler knows how many free GPUs on each kubelet. 2. Only dedicated GPUs support. 3. 1 GPU only can be assigned to 1 container now, but 1 container could has more than 1 GPU.

What should we do next? CRI support NVML support GPU driver volume support NCCL support

TensorFlow on Kubernetes

TensorFlow on Kubernetes with GPU enabled

Best practice Reduce the network communication could save the bandwidth lack. Use high frequency GPUs rather than GPUs for servers. Always save your training and serving data on a volume rather than save them inside container, not only for the training and serving model, but also the training can recover and get the training steps. Deploy multiple Parameter Server and deploy them on different server, it could balance the network.

What is TensorFlow as a Service (TaaS) 网页截图 TaaS = hosted, managed, and optimized TensorFlow with multiple developed models for real-world industrial solutions

Compare with original TensorFlow environment Operation Original TensorFlow Caicloud TaaS Environment Setup Single server pip or docker image Integrated Caicloud TaaS image. Distributed environment Setup server one by one No need to do anything Resource management Usually, TensorFlow occupy all the resources Based on Kubernetes, resources could isolated. Module Training Training User need to config every parameter on each server Upload your model file, config the parameter on the web UI. Monitor Management Monitor Save logs and config TensorBoard manually Save logs/config TensorBoard automatically Model Serving Model API serving User need to implement it himself. Export TensorFlow Model automatically, and support RESTful and gRPC model online serving

Main features – Training Configure

Main features – Training Monitor

Main features - Model Serving Host

Main features - Storage

TaaS Training Resource Queue

TaaS Training Resource Pool

AI general case

What our company produce

Contract US Facebook https://www.facebook.com/Caicloud-108906236310777/ Twitter https://twitter.com/Caicloud_io

Q & A Thank you