Deploying Deep Learning Models on GPU Enabled Kubernetes Cluster

Slides:



Advertisements
Similar presentations
COM vs. CORBA.
Advertisements

Technical Architectures
Microsoft ASP.NET AJAX - AJAX as it has to be Presented by : Rana Vijayasimha Nalla CSCE Grad Student.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Platform as a Service (PaaS)
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Lecture 8 – Platform as a Service. Introduction We have discussed the SPI model of Cloud Computing – IaaS – PaaS – SaaS.
Protect Your Business-Critical Data in the Cloud with SoftNAS, a Full-Featured, Highly Available Solution for the Agile Microsoft Azure Platform MICROSOFT.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
GOOGLE APP ENGINE By Muktadiur Rahman. Contents  Cloud Computing  What is App Engine  Why App Engine  Development with App Engine  Quote & Pricing.
Flight is a SaaS Solution that Accelerates the Secure Transfer of Large Files and Data Sets Into and Out of Microsoft Azure Blob Storage MICROSOFT AZURE.
Selenium server By, Kartikeya Rastogi Mayur Sapre Mosheca. R
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
National College of Science & Information Technology.
INTRODUCTION TO WEB HOSTING
Platform as a Service (PaaS)
Connected Infrastructure
Microservice Bus Tutorial Huabing Zhao
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Tiny http client and server
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
Docker and Azure Container Service
Day 1 on Google Cloud Platform
Infrastructure Orchestration to Optimize Testing
Vidcoding Introduces Scalable Video and TV Encoding in the Cloud at an Affordable Price by Utilizing the Processing Power of Azure Batch MICROSOFT AZURE.
Automate Custom Solutions Deployment on Office 365 and Azure
Docker Birthday #3.
Free Cloud Management Portal for Microsoft Azure Empowers Enterprise Users to Govern Their Cloud Spending and Optimize Cloud Usage and Planning MICROSOFT.
Platform as a Service.
Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.
TensorFlow on Kubernetes with GPU Enabled
DL (Deep Learning) Workspace
Connected Infrastructure
Azure Machine Learning & ML Studio
New Mexico State University
Chapter 3: Windows7 Part 4.
OpenNebula Offers an Enterprise-Ready, Fully Open Management Solution for Private and Public Clouds – Try It Easily with an Azure Marketplace Sandbox MICROSOFT.
Best Practices for Load Balancing Your GlobalSearch Installation
Using docker containers
Azure Container Instances
Topic 5: Communication and the Internet
Parallel Programming in Contemporary Programming Languages (Part 2)
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Intro to Docker Containers and Orchestration in the Cloud
Auth0 Is Identity Made Simple for Developers, Built by Developers and Supported by the High Availability and Performance of Microsoft Azure MICROSOFT AZURE.
CloudLabs, Powered by Azure, Enables the Quick, Easy, Cost-Effective Management, Distribution of Online Training Labs for Education and Business MICROSOFT.
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
Developing for the cloud with Visual Studio
Glynk on Microsoft Azure: A Social Networking Platform Connecting Like-Minded People Nearby for Recommendations, Activities, and Meetups MICROSOFT AZURE.
: Infrastructure for Complete Machine Learning Lifecycle
ADAM on Microsoft Azure Streamlines Access and Control of Full Function Digital Asset and Product Content Management for All Workers MICROSOFT AZURE ISV.
Learn. Imagine. Build. .NET Conf
Cloud Web Filtering Platform
Internet and Web Simple client-server model
Planning and Storyboarding a Web Site
Introduction to Docker
OpenStack Summit Berlin – November 14, 2018
Azure Container Service
A DevOps process for deploying R to production
Deploying machine learning models at scale
Containers and DevOps.
GLENNA2 – The Nordic Cloud
Containers on Azure Peter Lasne Sr. Software Development Engineer
9/16/2019 6:55 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
DIBBs Brown Dog BDFiddle
SQL Server on Containers
Presentation transcript:

Deploying Deep Learning Models on GPU Enabled Kubernetes Cluster Mathew Salvaris (@MSalvaris) & Fidan Boylu Uz (@ FidanBoyluUz) Senior Data Scientists at Microsoft

- Payload, batching, HTTP, Kubernetes GPU/CPU comparison for inference Basics of Deployment - Payload, batching, HTTP, Kubernetes GPU/CPU comparison for inference Deployment on Kubernetes Using Kubectl Using AzureML Using Kubeflow and TF serving

Performant deployment Deploying Ease of Use Performance Talk Examples Simple Tutorial Performant deployment There are many ways to carry out deployments of deep learning models. We want to set expectations as to what these examples represent If we were to create a graph of ease of use vs performance, it might look something like this Bottom far right are the simple examples that might just use Flask and don’t talk about putting them in containers, using a proper webserver and deploying on scalable infrastructure In the top Left are the deployments carried out by organisations that expect the models to handle millions of requests and have very strong latency and throughput requirements. These will often not deploy in Python but will use more performant languages Finally the examples presented here are faster and more performant that basic tutorials and sacrifice a little of ease of use

Cat Dog Mouse RGB Channels Convolution layer Pooling layer penultimate layer RGB Channels of input image Convolution layer with Kernels Pooling layer Fully connected layer Cat Dog Mouse Familiar to all this is a simple illustration of a CNN This is an over simplification but is here to get everyone on the same page

Payload Encode into Base64 Transfer as file decoded_img = base64.b64decode(request.json["input"]) img_buffer = BytesIO(decoded_img) pil_img = Image.open(img_buffer).convert("RGB") Image.open(request.files['image']).convert("RGB") If you have followed a number of tutorials online on how to deploy image classification models you probably come up against two alternatives on how the data is transferred. A common approach is to encode the image as a base64 string and then send that as JSON. On the server side you will do the above. Grab the json and decode it Then open it up using PIL or whatever other library you use for loading images. This is nice and simple since you are just passing a string in the JSON object. The other alternative is that you don’t encode the file but simply create a post request with a file in it. This is easy to handle as you can see above. By encoding the image as a string we actually increase the payload size. Benchmarking we did showed that this doesn’t significantly affect performance on standard image sizes such as 224 by 224 but with larger images such as 1280 by 1280 the performance penalty becomes quite noticeable. multipart/form-data

Batching Requests Non-batched GPU Batched GPU In this diagram here Top line no batching Bottom line batching Another thing to consider when operationalising models is that GPUs benefit a lot from batching. What this mean in practice is your endpoint should allow for the ability of multiple images to be sent in one request.

HTTP time Multiple Connections Persistent Connections Pipelining Head of Line Blocking Multiplexing client server client server open close client server open close server client open close client server open close open close time open close open close Next we talk about the communication between your deployed model and the service calling it The standard way is with multiple connections. Opening and closing each time. If you are carrying out multiple requests this can be quite slow The second version is with persistent connections, keep alive. What this does is that it keeps the connection open between requests. This is also what we use when we do throughput testing using asynchronous calls Now there is also the possibility of pipelining which will allow you to send multiple requests without waiting for a response. The responses though have to be returned in order which lead to the head of line blocking problem, illustrated here Finally in HTTP 2.0 there is a possibility of using multiplexing what this does is it adds the ability of pipelining without the issue of head of line blocking Use HTTP 2 if possible and try to avoid head-of-line blocking gRPC can provide better performance than REST API So whenever possible you should use HTTP2 and multiplexing to avoid the issues we showed This is common in gRPC which can also offer better performance due to protobuf Of course this is only REST messages typically contain JSON. This is not a strict requirement, and in theory you can send anything as a response, but in practice the whole REST ecosystem—including tooling, best practices, and tutorials—is focused on JSON. It is safe to say that, with very few exceptions, REST APIs accept and return JSON.  gRPC, on the other hand, accepts and returns Protobuf messages. I will discuss the strong typing later, but just from a performance point of view, Protobuf is a very efficient and packed format. JSON, on the other hand, is a textual format. You can compress JSON, but then you lose the benefit of a textual format that you can easily expect. REST, as mentioned earlier, depends heavily on HTTP (usually HTTP 1.1) and the request-response model. On the other hand, gRPC uses the newer HTTP/2 protocol. 

Kubernetes Cluster Node Kubernetes is a open-source platform for managing containerized workloads and services. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available. In the diagram above Kubernetes is managing a cluster of three nodes with each node have 1 or more pods. In the example later we will see that in some cases 1 pod equates to one deployed model Kubernetes has a number of features. It can be thought of as: a container platform a microservices platform a portable cloud platform and a lot more.

CPU vs GPU A common question that is often asked is if CPU is performant enough for deploying deep learning models. The answer is it depends. We conducted a set of experiments to test the throughput across a number of models and across Keras and TF The cluster configurations were 1-3 nodes with K80 GPU And 5 nodes with 8 CPUs each which totals 40CPUs for the CPU configuration Price wise the 3 node GPU is equivalent to 5 node CPU As we can see that the CPU performance is dwarfed by the GPU performance even with 1 GPU As you might expect smaller models are quicker than larger models The requests were sent one by one and not batched and using persistent connections and in a asynchronous manner If they were sent in batches we would probably see an even greater gap in performance

Framework: Keras Model: ResNet152 In the previous tests we didn’t use MKLDNN for CPU and wanted to see if things improved by using it. As you can see that even with MKLDNN things are still nowhere near GPU performance We also found that the GPU performance was also more stable. We believe the reason is that with GPUs there is no contention for resources. In the CPU case it has to carry out the CNN computations as well as handle web requests etc.

3 Approaches… Deploy on Kubernetes with Kubectl Deploy on Kubernetes with Azure Machine Learning (AzureML/AML) Deploy on Kubernetes with Kubeflow Deployment by using the Kubernetes command line interface, Kubectl with YAML (or JSON in our case).

Common steps Develop model Develop model API Prepare docker container for web service Deploy on Kubernetes

Deploy on Kubernetes with Kubectl Develop model, model API, Flask App Build container image with model, model API and Flask App Test locally and push image to Docker Hub Provision Kubernetes cluster Connect to Kubernetes with kubectl Deploy application using manifest (.yaml)

Deploy on Kubernetes with Kubectl Azure Request With Payload Serve Scoring Requests Azure Kubernetes Services Distribute Requests to Pods Load Balancer Input Data HTTP Request HTTP Response Pod Docker Image Docker Hub Flask App Data Prediction Model Scoring Function https://github.com/Microsoft/AKSDeploymentTutorial.git

Deploy on Kubernetes with AzureML AzureML setup, develop model, register the model to your AzureML Workspace, develop model API (scoring script) Build image with AzureML using conda dependencies, pip requirements and other dependencies Pull image from Azure Container Registry (ACR) and test locally Provision Kubernetes cluster and deploy webservice with AzureML.

Deploy on Kubernetes with AzureML Azure Machine Learning Docker Image Azure Container Registry Model Registry Azure Machine Learning Input Data Serve Scoring Requests Azure Kubernetes Services Pod Web Service Model Scoring Script HTTP Request Request With Payload Microsoft Data Science Virtual Machine Authenticate and Distribute Requests to Pods Azure Machine Learning HTTP Response https://github.com/Microsoft/AKSDeploymentTutorialAML.git

Deploy on Kubernetes with Kubeflow Develop model and save as TensorFlow servable Pull TensorFlow Serving image from Docker hub, mount the model path, open REST API port, test locally Create Kubernetes cluster, attach blob storage on AKS, copy model servable Install Ksonnet, Kubeflow and deploy webservice using Kubeflow TensorFlow serving Component using ksonnet templates.

Deploy on Kubernetes with Kubeflow Azure TensorFlow Docker Image Docker Hub Model Servable Azure Blob Storage Input Image Serve Scoring Requests Azure Kubernetes Services Pod Web Service Model Scoring Script HTTP Request Distribute Requests to Pods TensorFlow Serving Component Request With Payload HTTP Response https://github.com/Microsoft/DeployDLKubeflowAKS.git

Thank You & Questions http://aka.ms/aireferencearchitectures @Msalvaris @FidanBoyluUz http://aka.ms/aireferencearchitectures