Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley

Slides:

Advertisements

Similar presentations

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.

If you have a transaction processing system, John Meisenbacher

1 Munther Abualkibash University of Bridgeport, CT.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.

Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.

SketchVisor: Robust Network Measurement for Software Packet Processing

Information Retrieval in Practice

Dan Crankshaw UCB RISE Lab Seminar 10/3/2015

Teck Chia Partner, Exponent.vc

He Xiangnan Research Fellow National University of Singapore

Recommendation in Scholarly Big Data

User Modeling for Personal Assistant

CS262: Computer Vision Lect 06: Face Detection

University of Maryland College Park

Native Ads by YeahMobi.

Success Stories.

Benchmarking Deep Learning Inference

Deep learning David Kauchak CS158 – Fall 2016.

Machine Learning overview Chapter 18, 21

Machine Learning overview Chapter 18, 21

Physical Changes That Don’t Change the Logical Design

Chilimbi, et al. (2014) Microsoft Research

Clipper: A Low Latency Online Prediction Serving System

An Artificial Intelligence Approach to Precision Oncology

Intro to Machine Learning

School of Computer Science & Engineering

ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION

R SE to the challenges of ntelligent systems

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.

Tracking Objects with Dynamics

Informix Red Brick Warehouse 5.1

AlphaGo with Deep RL Alpha GO.

Systems for ML Clipper Gaia Training (TensorFlow)

Mining Dynamics of Data Streams in Multi-Dimensional Space

TensorFlow and Clipper (Lecture 24, cs262a)

“Predictive Mobile Networks”

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

CMPT 733, SPRING 2016 Jiannan Wang

Deep Learning Tutorial

Data Warehousing and Data Mining

Logistic Regression & Parallel SGD

AdQ is Azure-Powered Pre-Roll Ad Management Software That Improves Pre-Roll Ad Performance, Increases Profits, and Optimizes User Experience MICROSOFT.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Disambiguation Algorithm for People Search on the Web

Approaching an ML Problem

Neural Networks Geoff Hulten.

Distributed Systems CS

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Minwise Hashing and Efficient Search

Logistics Project proposals are due by midnight tonight (two pages)

EE 193: Parallel Computing

Course Summary Joseph E. Gonzalez

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Model Compression Joseph E. Gonzalez

Dynamic Neural Networks Joseph E. Gonzalez

Kostas Kolomvatsos, Christos Anagnostopoulos

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Martin Croome VP Business Development GreenWaves Technologies.

Presentation transcript:

Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu

Systems for Machine Learning Big Data Training Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... primary focus of the ML research

Learning Big Data Training Big Model Please make a Logo!

? Learning Inference Big Model Application Query Big Data Training Decision Big Model

Inference Learning Big Model Application Timescale: ~10 milliseconds Query Application Big Data Training Decision Big Model Timescale: ~10 milliseconds Systems: online and latency optimized Less Studied …

Inference why is challenging? Models Queries Features Need to render low latency (< 10ms) predictions for complex Models Queries Top K Features SELECT * FROM users JOIN items, click_logs, pages WHERE … under heavy load with system failures.

Basic Linear Models (Often High Dimensional) Common for click prediction and text filter models (spam) Query x encoded in sparse Bag-of-Words: x = “The quick brown” = {(”brown”, 1), (”the”, 1), (“quick”, 1)} Rendering a prediction: θ is a large vector of weights for each possible word or word combination (n-gram models) … McMahan et al.: billions of coefficients

Computer Vision and Speech Recognition Deep Neural Networks (will cover in more detail later): 100’s of millions of parameters + convolutions & unrolling Requires hardware acceleration

Computer Vision and Speech Recognition Deep Neural Networks (will cover in more detail later): 100’s of millions of parameters + convolutions & unrolling Requires hardware acceleration http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf

Computer Vision and Speech Recognition Deep Neural Networks (will cover in more detail later): 100’s of millions of parameters + convolutions & unrolling Requires hardware acceleration Using Google's fleet of TPUs, we can find all the text in the Street View database in less than five days. In Google Photos, each TPU can process [more than] 100 million photos a day. -- Norm Jouppi (Google) Titan can process 863 photos a second Snapchat claims to have 5000 photos every second Youtube uploads 5 hours of new video every second What about bursts of traffic? >1000 photos a second on a cluster of ASICs http://www.techradar.com/news/computing-components/processors/google-s-tensor-processing-unit-explained-this-is-what-the-future-of-computing-looks-like-1326915

Robust Predictions Often want to quantify prediction accuracy (uncertainty) Several common techniques Bayesian Inference Need to maintain more statistics about each parameter Often requires matrix inversion, sampling, or numeric integration Bagging Multiple copies of the same model trained on different subsets of data Linearly increases complexity Quantile Methods Relatively lightweight but conservative In general robust predictions  additional computation

Two Approaches Inference Eager: Pre-Materialize Predictions Query Inference Application Decision Big Model Two Approaches Eager: Pre-Materialize Predictions Lazy: Compute Predictions on the fly

Eager: Pre-materialize Predictions Examples Zillow might pre-compute popularity scores or house categories for all active listings Netflix might pre-compute top k movies for each user daily Advantages Use offline training frameworks for efficient batch prediction Serving is done using traditional data serving systems Disadvantages Frequent updates to models force substantial computation Cannot be applied when set of possible queries is large (e.g., speech recognition, image tagging, …)

Lazy: Compute predictions at Query Time Examples Speech recognition, image tagging Ad-targeting based on search terms, available ads, user features Advantages Compute only necessary queries Enables models to be changed rapidly and bandit exploration Queries do not need to be from small ground set Disadvantages Increases complexity and computation overhead of serving system Requires low and predictable latency from models

Vertical Solutions to Real-time Prediction Serving Ad Click Prediction and Targeting a multi-billion dollar industry Latency sensitive, contextualized, high-dimensional models  ranking Content Recommendation (optional reading) Typically simple models trained and materialized offline Moving towards more online learning and adaptation Face Detection (optional reading) example of early work in accelerated inference  substantial impact Widely used Viola-Jones face detection algorithm (prediction cascades) Automatic Speech Recognition (ASR) (optional reading) Typically cloud based with limited literature Baidu Paper: deep learning + traditional beam search techniques Heavy use of hardware acceleration to make ”real-time” 40ms latency

Presentations Today Giulio Zhou: challenges of deployed ML from perspective of Google & Facebook Noah Golmat: eager prediction serving from within a traditional RDBMS using hazy Dan Crankshaw: The LASER lazy prediction serving system at LinkedIn and his ongoing work on the Clipper prediction serving system.

Future Directions

Research in Faster Inference Caching (Pre-Materialization) Generalize Hazy style Hölder’s Inequality bounds Cache warming and prefetching & approximate caching Batching  better tuning of batch sizes Parallel hardware acceleration GPU  FPGA  ASIC acceleration Leveraging heterogeneous hardware with low bit precision Secure Hardware Model compression Distillation (will cover later) Context specific models Cascading Models: fast path for easy queries Inference on the edge: utilize client resources during inference

Research in Model Life-cycle Management Performance monitoring Detect potential model failure with limited or no feedback Incremental model updates Incorporate feedback in real-time to update entire pipelines Tracking model dependencies Ensure features are not corrupted and models are updated in response to changes in upstream models Automatic model selection Choosing between many candidate models for a given prediction task