DL (Deep Learning) Workspace

Slides:



Advertisements
Similar presentations
Technical Architectures
Advertisements

Object Oriented Databases by Adam Stevenson. Object Databases Became commercially popular in mid 1990’s Became commercially popular in mid 1990’s You.
VARONIS OVERVIEW DATA GOVERNANCE & SECURE FILE SHARING JUNE 5, 2013 Presented By: Dietrich Benjes VARONIS SYSTEMS. PROPRIETARY & CONFIDENTIAL.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
Microsoft Azure SoftUni Team Technical Trainers Software University
0 SharePoint Search 2013 Rafael de la Cruz SharePoint Developer Seneca Resources twitter.com/delacruz_rafael
Chapter 8 Configuring and Managing Shared Folder Security.
Integrated Mobile Marketing Platform Emergic mConnector Integrated -Mobile Marketing Platform Presented By: Sales Person Name ID: Mobile:
SQL Server 2012 Session: 1 Session: 4 SQL Azure Data Management Using Microsoft SQL Server.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
AA202: Performance Enhancers for Laserfiche Connie Anderson, Technical Writer.
Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.
The Holmes Platform and Applications
Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad
Top 10 Entity Framework Features Every Developer Should Know
Architecture Review 10/11/2004
Job Scheduling and Runtime in DLWorkspace
CLOUD ARCHITECTURE Many organizations and researchers have defined the architecture for cloud computing. Basically the whole system can be divided into.
Jean-Philippe Baud, IT-GD, CERN November 2007
Business System Development
Fan Engagement Solution
Project Advisor: Dr. Jerry Gao
Build /26/2018 6:17 AM Building Resilient, Scalable Services with Microsoft Azure Service Fabric Érsek © 2015 Microsoft Corporation.
DL (Deep Learning) Workspace
WWW and HTTP King Fahd University of Petroleum & Minerals
Connected Maintenance Solution
AlwaysOn Mirroring, Clustering
GWE Core Grid Wizard Enterprise (
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Consulting Services JobScheduler Architecture Decision Template
UI-Performance Optimization by Identifying its Bottlenecks
Connected Maintenance Solution
Server Concepts Dr. Charles W. Kann.
Microsoft SharePoint Server 2016
Lesson 11: Web Services & API's
Did your feature got in, out or planned?
Dmytro Mykhailov How HashiCorp platform tools can make the difference in development and deployment Target and goal of HashiCorp.
Processes The most important processes used in Web-based systems and their internal organization.
DL (Deep Learning) Workspace
IBM Data Server Gateway for OData
Cloud based Open Source Backup/Restore Tool
#01 Client/Server Computing
New Mexico State University
Getting Started with LANGuardian
Using docker containers
NASA Technical Report Server (NTRS) Project Overview April 2, 2003
02 | Hosting Services in Windows Azure
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Microsoft Connect /17/ :34 AM
VOLTHA Lock-In January 10 & 11, 2018.
Developing for the cloud with Visual Studio
Clouds & Containers: Case Studies for Big Data
Learn. Imagine. Build. .NET Conf
A 5-minute overview of ADAudit Plus
Chapter 7 –Implementation Issues
Internet and Web Simple client-server model
PROJECT PROGRESS PRESENTATION
Why Threads Are A Bad Idea (for most purposes)
5 Azure Services Every .NET Developer Needs to Know
Agenda Need of Cloud Computing What is Cloud Computing
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Message Passing Systems Version 2
Mark Quirk Head of Technology Developer & Platform Group
Johan Lindberg, inRiver
Microservices – What Exactly Am I Securing Again?
Research on edge computing system based on Linux EdgeX Foundry
#01 Client/Server Computing
Presentation transcript:

DL (Deep Learning) Workspace Engineering Practice and Lessons Learnt Hongzhi Li, Jin Li, Sanjeev Mehrotra

Key Engineering Practice Separate Configuration & code Microservice Architecture with Mostly Stateless Module High Quality Modules

Separate Configuration & code Benefit: Significant code reuse and stability of the code base A default configuration (in deploy.py), checked into repo Cluster specific configuration (in config.yaml) is not checked in Backup/restore operation: Backup/restore cluster related configuration + keys from/to a blob Code + configuration will be rendered to a location that is gitignored, and executed there State of the cluster managed by database (SQL Azure) User database (who is authenticated, who is admin, etc..) Scheduling database (what job is being scheduled, deleted, etc..)

Microservice Architecture With mostly Stateless Modules Build DL workspace as a collection of Microservice Minimum dependency among services (so that cost of switching a module is low) OpenID authentication, secured etcd/kubernete clusters MySQL vs SQL CoreOS vs Ubuntu Asp .Net core vs flask API service File share: NFS, HDFS, GlusterFS, CIFS (Azure File Share) Minimal/no change to other module when a module need to be changed/updated Stateless microservice is strongly preferred States are preserved in either SQL server or etcd servers

High Quality Module Evaluate the quality of a module before using it Docker is of good quality, and has been stable Kubernete is of good quality, and has been stable Nvidia-docker (preferred platform for DL workload) Zombie process (Nvidia driver) Try best not to hack a module Use docker/nvidia-docker/kubernete/glusterfs/hdfs as is Minimal code, and most of the issues we have encountered have also been encountered by the community

Backup

Modern microservice architecture Embrace micro-services Hundreds/thousands of independent services forms an ecosystem Each microservice should evolve by its own (created/justified/deprecated through usage, not top-down design) Each service is single purpose, with simple and well-defined API, modular and independent Goals of service owner: meet the needs of my clients, at minimum cost and effort Standardize communication (network protocols, data formats, schema between services), rather than service themselves A service became standard by being better than its alternative Standardize infrastructure (cluster management, monitoring, diagnostic, alerting, etc..) No need to standardize internals (e.g., programming language, framework, persistence) Encourage open-source like practice: Good documentation from the get go Searchable code/documentation and discussion forum Microservices ecosystem highly encourage open source development model. Each microservice should be well documented, with searchable source code and ease to try out examples to encourage the adoption of each service independently. ----- From Abhi -------------  Let me elaborate with a few examples... In case of Skype real-time workloads, the engineers initially had 3 micro-services: audio service, video service and transport service, with real-time requirement of 150-250ms from client connecting over Internet to three micro-services, the latencies of synchronizing across these micro services did not leave enough budget for the delay over Internet, there was a lot of basic problems and complex synchronization code as well. That resulted in a full and a very costly rearchitecture to run these three micro services into one consolidated service called audio video conferencing service. If the engineers had assessed need for consolidation of these micro-services upfront based on real-time requirements, that would have resulted in the right consolidation from the beginning. They got carried away with embracing micro-services without understanding the scenario requirements fully.  I already gave example of speech, where initially there were three micro services: Bing front door, lobby and speech recognition service, and that lobby got consolidated into Bing Front door. In case of retail, we have scenarios needing both real-time responsiveness and non real-time. For example, recognizing a product as it's put in a smart cart, that needs to happen quickly, as items can be put in the cart back to back or taken out back to back, we can lose tracking of these if the detection is too slow. For this scenario, we plan to be deliberate about consolidation as much as needed including doing edge compute of relevant services. So that's not just consolidation but also running it outside of cloud, right on the edge. But there are other scenarios where we don't have real-time requirement like business analytics based on demographic info of people coming into the store and their behaviors for example what they  like/touch, for these we'll have micro services model to stitch info from different signals/micro services for the analytics. Thx, Abhi