DL (Deep Learning) Workspace

DL (Deep Learning) Workspace
Engineering Practice and Lessons Learnt Hongzhi Li, Jin Li, Sanjeev Mehrotra

Key Engineering Practice
Separate Configuration & code Microservice Architecture with Mostly Stateless Module High Quality Modules

Separate Configuration & code
Benefit: Significant code reuse and stability of the code base A default configuration (in deploy.py), checked into repo Cluster specific configuration (in config.yaml) is not checked in Backup/restore operation: Backup/restore cluster related configuration + keys from/to a blob Code + configuration will be rendered to a location that is gitignored, and executed there State of the cluster managed by database (SQL Azure) User database (who is authenticated, who is admin, etc..) Scheduling database (what job is being scheduled, deleted, etc..)

Microservice Architecture With mostly Stateless Modules
Build DL workspace as a collection of Microservice Minimum dependency among services (so that cost of switching a module is low) OpenID authentication, secured etcd/kubernete clusters MySQL vs SQL CoreOS vs Ubuntu Asp .Net core vs flask API service File share: NFS, HDFS, GlusterFS, CIFS (Azure File Share) Minimal/no change to other module when a module need to be changed/updated Stateless microservice is strongly preferred States are preserved in either SQL server or etcd servers

High Quality Module Evaluate the quality of a module before using it
Docker is of good quality, and has been stable Kubernete is of good quality, and has been stable Nvidia-docker (preferred platform for DL workload) Zombie process (Nvidia driver) Try best not to hack a module Use docker/nvidia-docker/kubernete/glusterfs/hdfs as is Minimal code, and most of the issues we have encountered have also been encountered by the community

Backup

Modern microservice architecture
Embrace micro-services Hundreds/thousands of independent services forms an ecosystem Each microservice should evolve by its own (created/justified/deprecated through usage, not top-down design) Each service is single purpose, with simple and well-defined API, modular and independent Goals of service owner: meet the needs of my clients, at minimum cost and effort Standardize communication (network protocols, data formats, schema between services), rather than service themselves A service became standard by being better than its alternative Standardize infrastructure (cluster management, monitoring, diagnostic, alerting, etc..) No need to standardize internals (e.g., programming language, framework, persistence) Encourage open-source like practice: Good documentation from the get go Searchable code/documentation and discussion forum Microservices ecosystem highly encourage open source development model. Each microservice should be well documented, with searchable source code and ease to try out examples to encourage the adoption of each service independently. ----- From Abhi Let me elaborate with a few examples... In case of Skype real-time workloads, the engineers initially had 3 micro-services: audio service, video service and transport service, with real-time requirement of ms from client connecting over Internet to three micro-services, the latencies of synchronizing across these micro services did not leave enough budget for the delay over Internet, there was a lot of basic problems and complex synchronization code as well. That resulted in a full and a very costly rearchitecture to run these three micro services into one consolidated service called audio video conferencing service. If the engineers had assessed need for consolidation of these micro-services upfront based on real-time requirements, that would have resulted in the right consolidation from the beginning. They got carried away with embracing micro-services without understanding the scenario requirements fully. I already gave example of speech, where initially there were three micro services: Bing front door, lobby and speech recognition service, and that lobby got consolidated into Bing Front door. In case of retail, we have scenarios needing both real-time responsiveness and non real-time. For example, recognizing a product as it's put in a smart cart, that needs to happen quickly, as items can be put in the cart back to back or taken out back to back, we can lose tracking of these if the detection is too slow. For this scenario, we plan to be deliberate about consolidation as much as needed including doing edge compute of relevant services. So that's not just consolidation but also running it outside of cloud, right on the edge. But there are other scenarios where we don't have real-time requirement like business analytics based on demographic info of people coming into the store and their behaviors for example what they like/touch, for these we'll have micro services model to stitch info from different signals/micro services for the analytics. Thx, Abhi

DL (Deep Learning) Workspace

Similar presentations

Presentation on theme: "DL (Deep Learning) Workspace"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DL (Deep Learning) Workspace

Similar presentations

Presentation on theme: "DL (Deep Learning) Workspace"— Presentation transcript:

Similar presentations

About project

Feedback