Dan Crankshaw UCB RISE Lab Seminar 10/3/2015

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
1 CS115 Class 7: Architecture Due today –Requirements –Read Architecture paper pages 1-15 Next Tuesday –Read Practical UML.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
Tyson Condie.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Joseph E. Gonzalez Assistant UC Berkeley Dato Inc. Intelligent Services Serving Machine.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.
Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Intelligent Services Serving Machine Learning
CS223: Software Engineering
Systems Analysis and Design in a Changing World, Fifth Edition
TensorFlow– A system for large-scale machine learning
Chapter 7: Modifiability
Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Dynamo: A Runtime Codesign Environment
The Development Process of Web Applications
Distributed Shared Memory
Large-scale Machine Learning
Clipper: A Low Latency Online Prediction Serving System
IS301 – Software Engineering Dept of Computer Information Systems
Applying Control Theory to Stream Processing Systems
Chapter 18 Maintaining Information Systems
SOFTWARE DESIGN AND ARCHITECTURE
System Design Ashima Wadhwa.
UI-Performance Optimization by Identifying its Bottlenecks
R SE to the challenges of ntelligent systems
Software Design and Architecture
Deep Learning Platform as a Service
Chapter 18 MobileApp Design
Drum: A Rhythmic Approach to Interactive Analytics on Large Data
Many-core Software Development Platforms
Systems for ML Clipper Gaia Training (TensorFlow)
Introduction to Spark.
Machine Learning Platform Life-Cycle Management
Predictive Performance
CMPT 733, SPRING 2016 Jiannan Wang
: Infrastructure for Complete Machine Learning Lifecycle
Overview of big data tools
AWS Cloud Computing Masaki.
Where Intelligence Lives & Intelligence Management
Chapter 7 –Implementation Issues
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Control Theory in Log Processing Systems
Quality-aware Middleware
A platform for the Complete Machine Learning Lifecycle
Course Summary Joseph E. Gonzalez
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
CMPT 733, SPRING 2017 Jiannan Wang
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Modeling IDS using hybrid intelligent systems
ONAP Architecture Principle Review
Presentation transcript:

Dan Crankshaw UCB RISE Lab Seminar 10/3/2015 Prediction Systems Dan Crankshaw UCB RISE Lab Seminar 10/3/2015

Learning Big Model Timescale: minutes to days Data Training Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab

? Learning Inference Big Model Application Query Big Data Training Decision Big Model

Inference Learning Big Model Application Timescale: ~20 milliseconds Query Application Big Data Training Decision Big Model Timescale: ~20 milliseconds Systems: online and latency optimized Less studied …

Learning Inference Big Model Application Feedback Query Big Data Training Decision Big Model Feedback

Learning Inference Application Feedback Timescale: hours to weeks Training Decision Big Data Application Timescale: hours to weeks Systems: combination of systems Less studied … Feedback

Learning Inference Big Model Application Feedback Responsive Adaptive Query Big Data Application Training Responsive (~10ms) Adaptive (~1 seconds) Decision Big Model Feedback

Prediction Serving Challenges Complexity of deploying new models New applications or products (0  1 models). New data, features, model family: (N  N+1 models). Why is it hard: Frameworks not designed for low-latency serving, frameworks have different APIs, different resource requirements, and different costs. System Performance Need to ensure low-latency predictions, scalable throughput. Deploying a new model can’t degrade system performance. Model or Statistical Performance Model Selection: Which models to use? When to deploy a new model? How to adapt to feedback? At a meta-level: what are the right metrics for measuring model performance?

LASER: A Scalable Response Prediction Platform for Online Advertising Agarwal et al. 2014

LASER Overview Top-down system design enforced by company organizational structure Picked a model (logistic regression) and built the system based on that choice Force data-scientists to use this model, express features in specialized configuration language Result: System and model family are tightly coupled This is already more general purpose than application-specific deployments because they can serve many of these models for different applications within their system

Addressing Deployment Complexity Fixed Model Choice: Can be hardcoded into system, no need for API to specify model Configuration language: specify feature construction in JSON- based configuration language Restricts feature transformations to be built from component library Allows for changes in pipeline without service restarts or code modification Allows easy re-use of common features across an organization Similar to PMML, PFA Language details Source: translate data to numeric feature vectors Transformer: Vector-to-vector transformations (transform, aggregate) Assembler: Concatenates all feature pipelines together into single vector

Addressing System Performance Precompute second-order interaction terms The LASER logistic regression model includes second order interaction terms between user and campaign features: Don’t wait for delayed features Features can be delayed by slow DB lookup, expensive computation Solution: Substitute expected value for missing features and degrade accuracy, not latency Solution: Cache precomputed scalar products in PRC, save overhead of re-computing features and dot products which are lazily evaluated Combination of eager (precompute second order terms) and lazy approaches

Addressing Model Performance Decompose model into slowly-changing and quickly-changing components Fast retraining of warm-start (quickly-changing) component of model without cost of full retraining Warm Start Trained Online Cold Start Trained Offline Explore/Exploit with Thompson Sampling Sometimes serve ads with low empirical mean but high-variance Draw sample from posterior distribution over parameters and use sample to predict CTR instead of mode In practice, hold fixed and sample from

Some Takeaways from LASER System performance is paramount in the broader application context Slow page load has much larger impact on revenue than poor ad- recommendation AUC/accuracy is not always the most useful model performance metric The more assumptions you can make about your tools (software, models) the more tricks you can play (config language, shared features, warm-start/cold-start decomposition) Safe for LASER to make these assumptions because they are enforced through extra-technological methods Similar to some of the design choices we saw in Borg last week

A Low-Latency Online Prediction Serving System Clipper A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang Giulio Zhou Michael Franklin, Joseph E. Gonzalez Ion Stoica

Goals of Clipper Design Choice: General purpose, easy to use prediction serving system Generalize to many different ML applications (contrast to LASER which was designed to address LinkedIn’s ad-targeting needs) Generalize to many frameworks/tools for a single application Don’t tie the hands of data scientists developing models Make it simple for a data-scientist to deploy a new model into production Given these design choices, maximize system and model performance using model-agnostic techniques

Clipper Generalizes Models Across ML Frameworks Fraud Detection Content Rec. Personal Asst. Robotic Control Machine Translation Clipper Create VW Caffe

Clipper VW Caffe Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Create VW Caffe

Clipper Caffe VW Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper ust RPC RPC RPC RPC Model Wrapper (MW) MW MW MW VW Create Caffe

Clipper Caffe Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Improve accuracy through ensembles, online learning and personalization Model Abstraction Layer Provide a common interface to models while bounding latency and maximizing throughput. RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Caffe Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Selection Policy Model Abstraction Layer Caching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Caffe Applications Predict RPC/REST Interface Observe Model Selection Layer Selection Policy Model Abstraction Layer Caching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe Provide a common interface to models while bounding latency and maximizing throughput.

Clipper Caffe Correction Layer Model Abstraction Layer RPC RPC RPC RPC Correction Policy Model Abstraction Layer Approximate Caching Adaptive Batching Model Wrapper (MW) RPC Caffe MW RPC MW RPC MW RPC Common Interface  Simplifies Deployment: Evaluate models using original code & systems Models run in separate processes (Docker containers) Resource isolation

Clipper Caffe Model Selection Layer Model Abstraction Layer RPC RPC Selection Policy Model Abstraction Layer Caching Adaptive Batching Model Wrapper (MW) RPC Caffe MW RPC MW RPC MW RPC MW RPC MW RPC Common Interface  Simplifies Deployment: Evaluate models using original code & systems Models run in separate processes Resource isolation Scale-out Problem: frameworks optimized for batch processing not latency

Adaptive Batching to Improve Throughput Why batching helps: Optimal batch depends on: hardware configuration model and framework system load Clipper Solution: be as slow as allowed… Inc. batch size until the latency objective is exceeded (Additive Increase) If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease) A single page load may generate many queries Hardware Acceleration Helps amortize system overhead

Adaptive Batching to Improve Throughput 25.5x throughput increase

Clipper Caffe Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Selection Policy Model Abstraction Layer Caching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Applications Predict RPC/REST Interface Observe Model Selection Layer Selection Policy Goal: Maximize accuracy through bandits and ensembles, online learning, and personalization Incorporate feedback in real-time to achieve: robust predictions by combining multiple models & frameworks online learning and personalization by selecting and personalizing predictions in response to feedback

Clipper Learning Inference Feedback Application Fast Caffe Feedback Slow Changing Model Clipper Fast Model Selection per-User Big Data Application Feedback Fast Caffe Feedback Slow

Model Selection Policy Slow Changing Model Improves prediction accuracy by: Incorporates real-time feedback Estimates confidence of predictions Determines how to combine multiple predictions e.g., choose best, average, … enables frameworks to compete Clipper Fast Model Selection per-User Caffe

Clipper Cost of Ensembles ? Increased Load Solutions: Stragglers Caching and Batching Model Selection prioritizes frameworks for load-shedding Stragglers e.g., framework fails to meet SLO Solution: Anytime predictions Selection policy must select/combine from available predictions e.g., built-in ensemble policy substitutes expected value Slow Changing Model Clipper Fast Changing User Model ? Caffe

Limitations of Clipper Clipper does not address offline model retraining By treating deployed models as black boxes, Clipper forgoes the opportunity to optimize prediction execution of the models themselves or share computation between models Only performs coarse-grained tradeoffs of accuracy, robustness, and performance.

TensorFlow Serving Recently released open-source prediction-serving system from Google Companion to TensorFlow deep-learning ML framework Easy to deploy TensorFlow Models System automatically manages the lifetime of deployed models Watches for new versions, loads and transfers requests to new models automatically System does not address model performance, only system performance (through batching)

TensorFlow Serving Architecture Applications Predict RPC/REST Interface TensorFlow-Serving Prediction Batching New model version trained RETIRED V1 V2 V3

TensorFlow Serving Architecture Applications Predict RPC/REST Interface TensorFlow-Serving Prediction Batching RETIRED V1 V2 V3

Other Prediction-Serving Systems Turi Company co-founded by Joey, Carlos Guestrin, and others to serve predictions from models (primarily) trained in the GraphLab Create framework Not open-source Recently acquired by Apple Oryx Developed by Cloudera for serving Apache Spark Models Implementation of Lambda Architecture with Spark and Spark Streaming to incrementally maintain models Open source PredictionIO Open-source Apache Incubating project, the company behind the project was recently acquired by Salesforce Built on Apache Spark, Hbase, Spray, ElasticSearch