Systems for ML Clipper Gaia Training (TensorFlow)

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.
1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems (Fast’02) Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin.
Locality Aware Dynamic Load Management for Massively Multiplayer Games Jin Chen, Baohua Wu, Margaret Delap, Bjorn Knutson, Honghui Lu and Cristina Amza.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Artificial Neural Networks
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
Oral Defense by Sunny Tang 15 Aug 2003
Module I Overview of Computer Architecture and Organization.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Physical Layer Informed Adaptive Video Streaming Over LTE Xiufeng Xie, Xinyu Zhang Unviersity of Winscosin-Madison Swarun KumarLi Erran Li MIT Bell Labs.
Machine Learning Chapter 4. Artificial Neural Networks
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Distributed Systems CS Consistency and Replication – Part IV Lecture 21, Nov 10, 2014 Mohammad Hammoud.
READINGS IN DEEP LEARNING 4 Sep ADMINSTRIVIA New course numbers (11-785/786) are assigned – Should be up on the hub shortly Lab assignment 1 up.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Classification Ensemble Methods 1
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
Introduction to Machine Learning, its potential usage in network area,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Dan Crankshaw UCB RISE Lab Seminar 10/3/2015
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Big data classification using neural network
TensorFlow– A system for large-scale machine learning
Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley
Deep Feedforward Networks
Benchmarking Deep Learning Inference
Aaron Harlap, Alexey Tumanov*, Andrew Chung, Greg Ganger, Phil Gibbons
Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Large-scale Machine Learning
Randomness in Neural Networks
Chilimbi, et al. (2014) Microsoft Research
Clipper: A Low Latency Online Prediction Serving System
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CFA: A Practical Prediction System for Video Quality Optimization
R SE to the challenges of ntelligent systems
PA an Coordinated Memory Caching for Parallel Jobs
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Consistency in Distributed Systems
Adaptation Behavior of Pipelined Adaptive Filters
TensorFlow and Clipper (Lecture 24, cs262a)
MapReduce Simplied Data Processing on Large Clusters
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
StreamApprox Approximate Stream Analytics in Apache Spark
CMPT 733, SPRING 2016 Jiannan Wang
Logistic Regression & Parallel SGD
Consistency and Replication
Cooperative Caching, Simplified
Architectures of distributed systems Fundamental Models
Distributed Systems CS
Architectures of distributed systems Fundamental Models
Hybrid Programming with OpenMP and MPI
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
100+ Machine Learning Models running live: The approach
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Architectures of distributed systems
Architectures of distributed systems Fundamental Models
TensorFlow: A System for Large-Scale Machine Learning
CMPT 733, SPRING 2017 Jiannan Wang
Performance-Robust Parallel I/O
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Systems for ML Clipper Gaia Training (TensorFlow) On a single CPU/GPU, within a data center or across data centers Techniques used Data Parallelism Model Parallelism Parameter Server Goals: Fast training, accuracy Gaia Inference (TensorFlow Serving) On a single CPU/GPU or within data center Techniques used: batching, caching Goal: Inference query latency, throughput, accuracy Clipper

Clipper, NSDI 2017 Presented by Vipul Harsh (vharsh2@illinois.edu) D. Crankshaw, X. Wang, G. Zhou, M. Franklin, J. Gonzalez, Ion Stoica Presented by Vipul Harsh (vharsh2@illinois.edu) CS 525, UIUC

The need for fast inference Modern Large scale Machine learning systems Real time (Latency) Heavy query load (Throughput) Accuracy

The need for accuracy Immediate feedback about prediction quality Incorporate into model Not possible with a single model Recommender systems Vision applications Speech Recognition

Problem Statement System for real time ML inference Latency SLO Within Datacenter Immediate Feedback about prediction Goals: Latency, Throughput, Accuracy Incorporate user feedback Make system general enough for all frameworks

Models Galore Many models to choose from (SVM, Neural net, Regression) … from different frameworks (Tensorflow, Mllib, Caffe etc.) Static Model not enough for diverse set of clients Room for improving accuracy using multiple models

Clipper Unified prediction interface across models & frameworks Online model selection for accuracy Adaptive batching to meet latency & throughput Caching predictions

Clipper Architecture Clipper Model Selection Layer Model 1 Model 2 User Request Reply Feedback, true label Clipper Model Selection Layer Model 1 Model 2 Model n

Adaptive Model selection: 2 approaches Single Model Selection Have weights for each model Pick model with highest weight Update weight based on feedback Multiple Model Selection Inference on all models Ensemble all predictions Think about the tradeoffs.

Adaptive Model selection: Tradeoffs Single Model vs Multiple model selection Performance Latency Throughput Stragglers Accuracy Think about the tradeoffs.

Adaptive Model selection: Think about the tradeoffs.

Model Abstraction Layer 3 components Caching Prediction Adaptive batching Model containers

Prediction Caching Given a model and example, cache predicted label LRU eviction policy

Adaptive Batching per Model Different batch sizes for models, to meet SLO How to determine correct batch size? AIMD Additively increase batch size, till latency is met, Then backoff by a small factor (10%) Quantile Regression Model latency as a function of batch-size Batch size based on 99%ile latency

Adaptive Batching per Model

Model containers Each model sits in a model container Connection with clipper via RPCs Can dynamically scale up/down replicas

Model Abstraction Layer Clipper Architecture User Request Reply Feedback, true label Clipper Model Selection Layer Model Abstraction Layer Prediction Cache Batch Queue Model 1 Model 2 Model n

Takeaways: Multiple model better than single model Incorporate user feedback Better accuracy Online model selection Real-time queries require high Latency & Throughput Dynamic batching Caching

Discussion: Room for improvments? Model selection Incorporating feedback

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, and Phillip B. Gibbons, CMU; Onur Mutlu, ETH Zurich and CMU Symposium of Networked Systems Design and Implementation (NSDI ‘17) Presenter : Ashwini Raina

Large Scale ML Training Training within data center Data is centrally located Training happens over LAN (fast) Training Across data centers Data is geo-distributed Copying data to a central location is difficult WAN bandwidth is scarce Privacy and data sovereignty laws of countries Training over WAN is slow (1.8-53.8X slower) Image Reference: https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_hsieh.pdf

Training with Parameter Servers

Key Questions Asked How slow is ML training in geo-distributed data centers? Training over WAN compared to LAN Are all parameter updates in ML training “significant”? How to quantify ”significant” updates? Are BSP and SSP the best ML synchronization models? How to design a new synchronization model that only shares significant updates? Will the ML algorithm converge? What training time speed ups can we expect?

Parameter Synchronization Models Bulk Synchronous Parallel (BSP) All workers are synchronized after each iteration Stale Synchronous Parallel (SSP) Fastest worker is ahead of the slowest worker by a bounded number of iterations Total Asynchronous Parallel (TAP) No synchronization between workers BSP and SSP guarantee convergence. TAP does not.

Design Goal Develop a geo-distributed ML system that Key Intuition minimizes communication over WANs; and is applicable to a wide variety of ML algorithms Key Intuition Stale Synchronous Parallel (SSP) bounds how stale a parameter can be. ASP bounds how inaccurate a parameter can be. Vast majority of parameter updates are insignificant. 95% of the updates produce less than a 1% change to the parameter value.

WAN Bandwidth Measurements WAN BW is 15X slower than LAN on average and 60X slower in the worst case (Singapore <-> Sao Paulo) WAN BW of close regions is 12X that of distant regions (Oregon <-> California vs Singapore <-> Sao Paulo)

Training time over LAN vs WAN IterStore and Bosen are Parameter Server based ML frameworks ML application is Matrix Factorization V/C WAN (Virginia <-> California WAN) is closer than S/S WAN (Singapore<->Sao Paulo WAN)

Gaia Challenge 1 - How to effectively communicate over WANs while retaining algorithm convergence and accuracy? Challenge 2 - How to make the system generic and work for ML algorithms without requiring modification?

Gaia System Overview

Update Significance

Approximate Synchronous Parallel Significance Filter Significance function (|Update/Value|) Initial significance threshold (1%) To guarantee convergence, threshold is reduced by square root of the number of iterations ASP Selective Barrier Mirror Clock

Experimental Setup Amazon EC2 – 22 machines spread across 11 regions Emulation EC2 – 22 machines on local cluster emulating WAN Emulation Full Speed – 22 machines on local cluster (no slow downs) ML Applications Matrix Factorization (Netflix dataset) Topic Modeling (Nytimes dataset) Image Classification (ImageNet 2012 dataset)

Convergence Time Improvement

Convergence Time and WAN Bandwidth Virginia<->California WAN (close by) Singapore<->Sao Paulo WAN (far apart)

Gaia vs Centralized

Gaia vs Gaia_Async

Key Take-aways ML training is 1.8-53.7X slower on geo-distributed data centers Vast majority of parameter updates are insignificant BSP and SSP synchronization models are WAN bandwidth-heavy ASP model shares “significant” updates only and has proven convergence properties

Thoughts Gaia expects the ML programmer to provide a significance function which may not be straightforward for non-linear functions. Paper mentions 1-2% of threshold should work in most cases. ML application training is a vast space. It is not very intuitive to me why this claim might hold for most applications. Google introduced Federated Learning for training ML models on mobile devices without copying the data to the server. Motivations for doing so were similar (privacy, bandwidth, power etc.). It will be good to understand the similarities/differences in design.

Clipper and Gaia Discussion Zack Kimberg

Clipper Questions In Machine Learning models, training is usually performed offline and then the final model is used for production inference. Clipper is designed solely to improve production inference performance, but how could it be modified to also allow online training while in production?

Clipper Questions What are the pros and cons of implementing the production system with black box models like clipper vs. integrated models like Tensorflow Serving?

Clipper Questions Clipper contains several pieces of functionality necessary for a production system such as batching, caching, and model selection. What additional pieces of functionality could be added for faster or more robust inference?

Gaia Questions When would you use Gaia vs. a one-time centralization by shipping all the data to one data center?

Gaia Questions What potential issues could arise with the Gaia system?

Gaia Questions How could Gaia be improved?

Thank You !