Clipper: A Low Latency Online Prediction Serving System

Slides:



Advertisements
Similar presentations
Scheduling Algorithems
Advertisements

A Fast Growing Market. Interesting New Players Lyzasoft.
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Introduction to Hadoop and HDFS
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Server to Server Communication Redis as an enabler Orion Free
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
SketchVisor: Robust Network Measurement for Software Packet Processing
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Dan Crankshaw UCB RISE Lab Seminar 10/3/2015
Connected Infrastructure
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
CSCI5570 Large Scale Data Processing Systems
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
TensorFlow– A system for large-scale machine learning
Interactive Queries in Data Warehouses
Early Results of Deep Learning on the Stampede2 Supercomputer
Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley
Connected Maintenance Solution
Data Mining, Neural Network and Genetic Programming
Chilimbi, et al. (2014) Microsoft Research
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Connected Maintenance Solution
R SE to the challenges of ntelligent systems
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Analyzing Security and Energy Tradeoffs in Autonomic Capacity Management Wei Wu.
Connected Infrastructure
Deep Learning Libraries
Drum: A Rhythmic Approach to Interactive Analytics on Large Data
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Systems for ML Clipper Gaia Training (TensorFlow)
What is the Azure SQL Datawarehouse?
StreamApprox Approximate Stream Analytics in Apache Flink
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Predictive Performance
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
CPU Scheduling G.Anuradha
Early Results of Deep Learning on the Stampede2 Supercomputer
Cse 344 May 4th – Map/Reduce.
Jason furmanek Kris murphy IBM
Ch 4. The Evolution of Analytic Scalability
Logistic Regression & Parallel SGD
Clouds & Containers: Case Studies for Big Data
Overview of big data tools
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Declarative Transfer Learning from Deep CNNs at Scale
Designing Neural Network Architectures Using Reinforcement Learning
Database System Architectures
TensorFlow: A System for Large-Scale Machine Learning
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Deep Learning Libraries
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Neural Machine Translation using CNN
A tutorial on building large-scale services
Device Failure Prediction
Morteza Kheirkhah University College London
Machine Learning for Cyber
Presentation transcript:

Clipper: A Low Latency Online Prediction Serving System Shreya

The Problem Machine learning requires real time, accurate, and robust predictions under heavy query load. Most machine learning frameworks care about optimizing model training not deployment Current solutions for online predictions focus on a single framework

Clipper Online prediction system Modular architecture with two layers Paying concern to latency, throughput, and accuracy Written in Rust Support for multiple ML frameworks such as Spark, scikit- learn, caffe (computer vision), tensorflow, and HTK written in rust and has support for spark, scikit-learn, caffe (computer vision), tensorflow, and htk(speech recognition) Caffe – computer vision Htk – speech recognition

Architecture Two layers model selection layer, model abstraction layer, connected to individual containers. Communicate through RPC Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer on a network without having to understand the network's details. A procedure call is also sometimes known as a function call or a subroutine call.

Model Selection Layer Selects and combine predictions across competing models for more accuracy. Most ML networks are optimized for offline batch processing not single input prediction latency Solution: batching with limits Receive a query: dispatches to certain models based on previous feedback Dispatches to models using rest api Straggler mitigation technique: render predictions without waiting for slow models. Requires no modifications for frameworks

Selection Strategies Costly A/B online testing. It grows exponentially in number of candidate models. Select, combine, observe Select the best: Exp3 Ensemble select: Exp4

Selection Strategies Exp3: associate a weight si for each model. Select that model with probability si/sum of all other weights Updates the weight based on accuracy of prediction where the loss function is absolute loss ∈[0,1] η determines how quickly Clipper responds to feedback Multi armed bandit problem Speech recognition and computer vision in automatic captioning lightweight algorithm that requires only a single model evaluation for each pre- diction and thus performs well under heavy loads Strong theoretical guarantees

Selection Strategy Use linear ensemble methods which compute a weighted average of the base model predictions Exp4 Additional model containers increase the chance of stragglers. Solution: wait according to latency requirement and confidence reflects uncertainty If confidence is too low uses default setting. Bootstrap aggregation: bagging: used to reduce variance and thereby improve generalization performance While exp3 is bounded by best algorithm, exp4 increases performance with additional models

Exp3 and Exp4 Recovery After 5K queries the performance of the lowest-error model is severely degraded, and after 10k queries performance recovers. Exp3 and Exp4 quickly compensate for the failure and achieve lower error than any static model selection

Model Abstraction Layer Caches predictions on a per model basis and implements adaptive batching to maximize throughput given a query latency target Each queue has a latency target Optimal batch size: maximizes throughput under latency constraint. Uses AIDM Reduce by 10% as optimal batch size doesn’t fluctuate too much Dispatched via RPC system batching amortizes cost of RPC calls and internal framework overheas like copying inputs to gpu memory batching enables machine learning data parallel optimizations by performig batch inference???? On many inputs simultaneously tensorflow has fixed batch sizes to fully exploit gpu parallelism SVM can do 30 k qps while svm is limited to 200 qps. Application developers manually tune the batch size for each model

Model Abstraction Layer Prediction cache for query and model used, LRU Bursty workloads often results in less than max batch size. Could be beneficial to wait a little longer to queue up Delayed batching for Spark SVM does nothing because it is fast and handles several queries well, sci-kit learn svm benefits from this

Machine Learning Frameworks Each model is contained in a Docker container. Working on immature framework doesn’t interfere with performance of others Replication: resource intensive frameworks can get more than one container Could have very different performance across cluster Conclusion: network is the bottle neck

Machine Learning Frameworks Adding support for a framework takes < 25 lines of code To support context specific model selection (diatlect), model selection layer can instantiate a unitue model selection state for each user/context/session. Managed in external database (Redis)

Testing Tested against TensorFlow Employs static sized batching to optimize parallelization Used two containers one with the python and c++ api. Python was slow but c++ was near identical performance Use MNIST, CIFAR-10 and Imagenet as data sets to see how well done. First is common baseline. Second two are good TIMIT speech corpus [24] andthe HTK [63] machine learning framework. Dataset. Randomly drew from their 8 dialects on three TensorFlow object recognition deep networks of varying computational cost: a 4-layer convolutional neural network trained on the MNIST dataset [42], the 8-layer AlexNet [33]architecture trained on CIFAR-10 [32], and Google’s 22-layer Inception-v3 network [58] trained on ImageNet. Not too much overhead

Limitations Doesn’t talk about system requirements Replicating containers with same algorithm could do some optimization with this to minimize latency (half in one and half to other to minimize batch size) Also doesn’t mention not double counting the same algorithm

Related Work Amazon Redshift: shared nothing, can work with semi- structured data, time travel BigQuery: JSON and nested data support, has own SQL language, tables are append only, Microsoft SQL Data Warehouse: separates storage and compute, similar abstraction Data Warehouse Units, concurrency cap, no support for semi structured Amazon pay hour per node, higher compute per dollar Snowflake is vws per hour Big query pay for each query and data stored. Amazon is not acid, Snowflake: is good for sotpping and starting warehouses. Faster than redshift and it’s automatically done. Decoupled compute and storae unlike reshift. Big query can just run your query and not worry about allocating to specific cluster it wil scale “under the hood”. It also integrates with a lot of google’s tools, analytics, g suite etc

Future Work Make Snowflake a full self service model, without developer involvement. If a query fails it is entirely rerun, which can be costly for a long query. Each worker node has a cache of table data that currently uses LRU, but the policy could be improved. Snowflake also doesn’t handle the situation if an availability zone is unavailable. Currently it requires reallocating the query to another VW. Performance isolation might not be necessary so sharing a query among worker nodes could increase utilization.