Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular.

Slides:



Advertisements
Similar presentations
Spark: Cluster Computing with Working Sets
Advertisements

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Distributed Optimization with Arbitrary Local Solvers
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
PETUUM A New Platform for Distributed Machine Learning on Big Data
Vertex Coloring Distributed Algorithms for Multi-Agent Networks
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
TensorFlow– A system for large-scale machine learning
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Distributed Shared Memory
Large-scale Machine Learning
Chilimbi, et al. (2014) Microsoft Research
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Parallel Algorithm Design
Multimodal Learning with Deep Boltzmann Machines
Cache Memory Presentation I
Database Performance Tuning and Query Optimization
Lecture 5: GPU Compute Architecture
ECE 5424: Introduction to Machine Learning
Introduction to cosynthesis Rabi Mahapatra CSCE617
MapReduce Simplied Data Processing on Large Clusters
April 30th – Scheduling / parallel
Lecture 5: GPU Compute Architecture for the last time
Hierarchical clustering approaches for high-throughput data
February 26th – Map/Reduce
CMPT 733, SPRING 2016 Jiannan Wang
Cse 344 May 4th – Map/Reduce.
Hidden Markov Models Part 2: Algorithms
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Distributed System Gang Wu Spring,2018.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
COMP60621 Fundamentals of Parallel and Distributed Systems
CS510 - Portland State University
Chapter 11 Database Performance Tuning and Query Optimization
Topic models for corpora and for graphs
Parallel Perceptrons and Iterative Parameter Mixing
5/7/2019 Map Reduce Map reduce.
TensorFlow: A System for Large-Scale Machine Learning
CMPT 733, SPRING 2017 Jiannan Wang
COMP60611 Fundamentals of Parallel and Distributed Systems
COS 518: Distributed Systems Lecture 11 Mike Freedman
Sequential Learning with Dependency Nets
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
CSC 578 Neural Networks and Deep Learning
Accelerating Distributed Machine Learning by Smart Parameter Server
Map Reduce, Types, Formats and Features
Overall Introduction for the Lecture
Presentation transcript:

Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

Regret analysis for on-line optimization RECAP Regret analysis for on-line optimization

RECAP 2009

RECAP f is loss function, x is parameters Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

Regret: how much loss was incurred during learning, RECAP Regret: how much loss was incurred during learning, over and above the loss incurrent with an optimal choice of x Special case: ft is 1 if a mistake was made, 0 otherwise ft(x*) = 0 for optimal x* Regret = # mistakes made in learning

RECAP Theorem: you can find a learning rate so that the regret of delayed SGD is bounded by T = # timesteps τ= staleness > 0

RECAP Theorem 8: you can do better if you assume (1) examples are i.i.d. (2) the gradients are smooth, analogous to the assumption about L: Then you can show a bound on expected regret No-delay loss dominant term

RECAP Experiments

Parameter Servers (slides courtesy of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

Scalable Machine Learning Algorithms ML Systems Scalable Machine Learning Algorithms Abstractions Scalable Systems

ML Systems Landscape Hadoop, GraphLab, Spark Tensorflow Bosen, DMTK, Graph Systems Dataflow Systems Shared Memory Systems Model Very high-level overview of the current landscape in ML systems. First, dataflow systems such as Hadoop and spark. Data is independent records and push through a processing pipeline. Graph systems like Graphlab, etc. Problem is modeled as a graph, each node communicates with its neighbors. Distributed shared memory systems like Bosen, etc. Model is globally accessible and changed by external workers. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

ML Systems Landscape Algorithms Hadoop, GraphLab, Spark Tensorflow Graph Systems Dataflow Systems Shared Memory Systems Model Algorithms 1. Each of these systems aim to support certain types of algorithms. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

ML Systems Landscape Graph Algorithms, SGD, Sampling Naïve Bayes, Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] Graph systems: graph algorithms, graphical models such as HMM, etc. DSM: SGD, eg. SGD Logistic regression. Sampling, eg. Gibbs sampling LDA. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

ML Systems Landscape Abstractions Graph Algorithms, SGD, Sampling Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] Abstractions 1. Each of these types of systems offer a different abstraction as well. GraphLab, Tensorflow Hadoop & Spark Bosen, DMTK, ParameterServer.org

ML Systems Landscape Naïve Bayes, Rocchio Graph Algorithms, Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] PIG, GuineaPig, … Vertex-Programs Parameter Server Graph systems: Vertex programs, code runs on a vertex of the graph, communicate with its neighbors. DSM: Parameter servers. Model is stored on a globally-accessible parameter server, [UAI’10] [VLDB’10] GraphLab, Tensorflow Hadoop & Spark Bosen, DMTK, ParameterServer.org

ML Systems Landscape Parameter Server Graph Systems Dataflow Systems Shared Memory Systems Model Parameters of the ML system are stored in a distributed hash table that is accessible thru the network [NIPS’09, NIPS’13] Parameter Server Graph systems: Vertex programs, code runs on a vertex of the graph, communicate with its neighbors. DSM: Parameter servers. Model is stored on a globally-accessible parameter server, Param Servers used in Google, Yahoo, …. Academic work by Smola, Xing, … [VLDB’10]

Parameter Servers Are Flexible Implemented with Parameter Server Looking at the landscape, parameter servers are everywhere. Explain plot. Many companies use it, including Google, Yahoo, Microsoft.

Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory) Extensions: multiple keys (for a matrix); multiple “channels” (for multiple sparse vectors, multiple clients for same servers, …) [Smola et al 2010, Ho et al 2013, Li et al 2014]

Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Extensions: push/pull interface to send/receive most recent copy of (subset of) parameters, blocking is optional Extension: can block until push/pulls with clock < (t – τ ) complete [Smola et al 2010, Ho et al 2013, Li et al 2014]

Data parallel w1 w2 w3 w4 w5 w6 w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 w4 Parameter Server w1 w2 w3 Parameter Server w4 w5 w6 Parameter Server w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. w4 w6 w5 w1 w9 w8 w3 w7 w2 Data Data Data Data Split Data Across Machines

Data parallel w1 w2 w3 w4 w5 w6 w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 Parameter Server w1 w2 w3 Parameter Server w4 w5 w6 Parameter Server w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 Different parts of the model on different servers. Workers retrieve the part needed as needed Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. Data Data Data Data Split Data Across Machines

Parameter Server Abstraction Key-Value API for workers: get(key)  value add(key, delta) Model Parameter servers use a simple key-value store API. Get: retrieve values from server, and generate delta. Add: send delta to servers and generate new model. Key advantage: algorithms look like single thread. No need to create complicated pipeline of joins, maps, etc. like in MapReduce. Model Model

PS vs Hadoop Map-Reduce 1. You all have used it, easy to program? Efficient system for ML?

Iteration in Map-Reduce (IPM) w(0) Initial Model Map Reduce Learned Model w(1) Training Data w(2) Suspicion is that MR is not ideal, since this presentation is being given. Take a closer look at what an iterative ML algorithms running on MapReduce looks like. Take a large amount of data and load from disk to mappers, then reduce, then save the output back to disk. That’s not all, need to iterate again and again to converge. w(3)

Cost of Iteration in Map-Reduce w(0) Initial Model Learned Model Map Reduce w(1) Read 1 Training Data Repeatedly load same data Read 2 w(2) 1. Disk is really slow and we’re loading the same data repeatedly. Read 3 w(3)

Cost of Iteration in Map-Reduce w(0) Initial Model Learned Model Map Reduce w(1) Training Data Redundantly save output between stages w(2) 1. Saving the output to disk when they will be loaded again immediately, and we don’t even care about that intermediate value. w(3)

Parameter Servers Stale Synchronous Parallel Model (slides courtesy of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory) [Smola et al 2010, Ho et al 2013, Li et al 2014]

Iterative ML Algorithms Model Parameters Before going any further, look at some properties of machine learning algorithms. Take data and current parameters, compute update to model, apply update to model to obtain better model. Iterative: do it over and over again until convergence. Data Worker Topic Model, matrix factorization, SVM, Deep Neural Network…

? Map-Reduce Data Model Execution Semantics Programming Abstraction Parameter Server VS. Data Model Independent Records Independent Data Programming Abstraction Key-Value Store (Distributed Shared Memory) Map & Reduce Ask for questions here. Summarize slide. Only remaining question: execution semantics for parameter server? Execution Semantics Bulk Synchronous Parallel (BSP) ?

The Problem: Networks Are Slow! get(key) add(key, delta) Worker Machines Server Machines Big problem in distributed systems is always the network. To avoid network bottleneck, need to cache parameters locally on workers. Synchronization mechanism is key in determining performance, as we will see later. Network is slow compared to local memory access We want to explore options for handling this…. [Smola et al 2010, Ho et al 2013, Li et al 2014]

Solution 1: Cache Synchronization Data Data Data Server Example of worker cache synchronization. Data Data Data

Parameter Cache Synchronization Sparse Changes to Model Data Data Data Server Data Data Data

Parameter Cache Synchronization (aka IPM) Data Data Data Server Data Data Data

Solution 2: Asynchronous Execution Compute Communicate Compute Barrier Barrier Machine 1 Machine 2 Machine 3 Iteration Waste Iteration Iteration Iteration Waste 1. Previous animation is an example of bulk synchronous execution. 2. Other end of the spectrum: asynchronous. Remove barriers, communicate and compute whenever each worker likes. Iteration Waste Iteration Enable more frequent coordination on parameter values

Asynchronous Execution Parameter Server (Logical) w1 w2 w3 w4 w5 w6 w7 w8 w9 Iteration Machine 1 Complete overlap of communication and computation, no idle resources. Maximize throughput. Machine 2 Machine 3 [Smola et al 2010]

Asynchronous Execution Problem: Async lacks theoretical guarantee as distributed environment can have arbitrary delays from network & stragglers But….

RECAP f is loss function, x is parameters Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

? Map-Reduce Data Model Execution Bounded Semantics Asynchronous Parameter Server VS. Data Model Independent Records Independent Data Programming Abstraction Key-Value Store (Distributed Shared Memory) Map & Reduce 1. We propose bounded asynchronous. Execution Semantics Bounded Asynchronous Bulk Synchronous Parallel (BSP) ?

Bounded Asynchronous Parameter Server Stale synchronous parallel (SSP): Global clock time t Parameters workers “get” can be out of date but can’t be older than t-τ τ controls “staleness” aka stale synchronous parallel (SSP) 1. We propose bounded asynchronous. Bounded Asynchronous

Stale Synchronous Parallel (SSP) Allow workers to proceed at their own pace as long as they are within an acceptable threshold of each other. Black: all workers see these updates. Blue: unspecified, workers may see it or not. Worker 1 is blocked. Straggler-tolerant up to a certain threshold. Allows communication and computation to happen concurrently. Efficiently implementable. Ask for questions here. Interpolate between BSP and Async and subsumes both Allow workers to usually run at own pace Fastest/slowest threads not allowed to drift >s clocks apart Efficiently implemented: Cache parameters [Ho et al 2013]

Consistency Matters Suitable delay (SSP) gives big speed-up Consistency: how recent workers views of shared variables are. BSP – consistent, async – not consistent. Consistency really matters. LDA experiment Reduced network time with relaxed consistency. Strong consistency Relaxed consistency Suitable delay (SSP) gives big speed-up [Ho et al 2013]

Stale Synchronous Parallel (SSP) LDA experiment. Log likelihood vs time. SSP with correct staleness value is better than Async and BSP. SSP retains theoretical convergence guarantees. [Ho et al 2013]

Beyond the PS/SSP Abstraction…

Managed Communications Barrier sync – computation stalls here compute compute compute BSP compute compute compute SSP Further piece of work: managed communications. SSP prevents computation from stalling. However, network can be idle for much of the time. time Stalls only if certain previous sync did not complete BSP stalls during communication. SSP is able to overlap communication and computation….but network can be underused [Wei et al 2015]

Managed Communications Problem: Limited, spare network bandwidth exists Fresher values make computation more effective Use the spare bandwidth but no more and use it well, how? compute compute compute Existing: Problem is there is spare, but limited bandwidth. Know that fresher values make computations more effective. Knowing these facts, how can we more effectively make use of bandwidth to speed up the system? Solution is to actively manage communications in an intelligent manner. Solution: System framework for managing communication Automatically improves performance w/ spare bandwidth burst of traffic idle network [Wei et al 2015]

Bosen: choosing model partition server 1 server 2 Model partition Model partition worker 1 worker 2 ML App Client lib ML App Client lib First, a bit on Bosen’s architecture. Servers store and serve partitions of the model. Workers Store partitions of data, and the ML app does computations using the data and model. Client library takes care of synchronizing parameter caches according to SSP. Data partition Data partition Parameter Server [Power’10] [Ahmed’12] [Ho’13] [Li’14] Coherent shared memory abstraction for application Let the library worry about consistency, communication, etc [Wei et al 2015]

Bosen: Other Ways To Manage Communication Model parameters are not equally important E.g. Majority of the parameters may converge in a few iteration. Communicate the more important parameter values or updates Magnitude of the changes indicates importance Magnitude-based prioritization strategies Example: Relative-magnitude prioritization Intuition: Model parameters are not all equally important. Majority converge within a few iterations. Only communicate important parameters, ie. Those that are changing quickly. Can prioritize based on update magnitude. [Wei et al 2015] 48

Recap: What Problems Are We Solving?

Iterative ML Algorithms L: loss Δ: grad A: params at time t D: data F: update Before going any further, look at some properties of machine learning algorithms. Take data and current parameters, compute update to model, apply update to model to obtain better model. Iterative: do it over and over again until convergence. Many ML algorithms are iterative-convergent Examples: Optimization, sampling methods Topic Model, matrix factorization, SVM, Deep Neural Network…

Iterative ML with a Parameter Server: (1) Data Parallel Δ: grad of L D: data, shard p Usually add here 1. This is a natural way to parallelize ML algorithms. Split up data, each worker use a subset of the data to generate updates to the globally shared parameters. assume i.i.d Each worker assigned a data partition Model parameters are shared by workers Workers read and update the model parameters

(2) Model parallel Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. ignore D as well as L Sp is a scheduler for processor p that selects params for p

Parameter Server – scheduling… Optional scheduling interface schedule(key)  param keys svars push(p=workerId, svars)  changed key pull(svars, updates=(push1,….,pushn)) Parameter servers use a simple key-value store API. Get: retrieve values from server, and generate delta. Add: send delta to servers and generate new model. Key advantage: algorithms look like single thread. No need to create complicated pipeline of joins, maps, etc. like in MapReduce. Worker machines Scheduler machines

Support for model-parallel programs

Similar to signal-collect: schedule() defines graph, workers push params to scheduler, scheduler pulls to aggregate, and makes params available via get() and inc()

A Data Parallel Example

About: Distance metric learning Instance: pairs (x1,x2) Label: similar or dissimilar Model: scale x1 and x2 with matrix L, try and minimize distance ||Lx1 – Lx2||2 for similar pairs and max(0, 1-||Lx1 – Lx2||2 ) for dissimilar pairs using x,y instead of x1,x2

Example: Data parallel SGD Could also get only keys I need

A Model Parallel Example: Lasso

Regularized logistic regression Replace log conditional likelihood LCL with LCL + penalty for large weights, eg alternative penalty:

Regularized logistic regression shallow grad near 0 steep grad near 0 L1-regularization pushes parameters to zero: sparse

SGD Repeat for t=1,….,T For each example Compute gradient of regularized loss (for that example) Move all parameters in that direction (a little)

Coordinate descent Repeat for t=1,….,T For each parameter j Compute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)

Stochastic coordinate descent Repeat for t=1,….,T Pick a random parameter j Compute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)

Parallel stochastic coordinate descent (shotgun) Repeat for t=1,….,T Pick several coordinates j1,…,jp in parallel Compute gradient of regularized loss (for each parameter jk) Move each parameter jk

Parallel coordinate descent (shotgun)

Parallel coordinate descent (shotgun)

Example: Model parallel SGD Basic ideas: Pick parameters stochastically Prefer large parameter values (i.e., ones that haven’t converged) Prefer nearly-independent parameters

Example: Model parallel SGD

Example: Model parallel SGD

Case Study: Topic Modeling with LDA Summarize with a case study. Last homework will be to implement LDA on a parameter server.

Example: Topic Modeling with LDA Word-Topic Dist. Local Variables Documents Tokens Maintained by the Parameter Server Maintained by the Workers Nodes

Gibbs Sampling for LDA Word-Topic Dist’n Brains: Choose: Direction: Feet: Head: Shoes: Steer: Title: Oh, The Places You’ll Go! Doc-Topic Distribution θd Doc-Topic Distribution θd You have brains in your head. You have feet in your shoes. You can steer yourself any direction you choose. z1 z2 z3 z4 Gibbs sampling for LDA, you all have seen this last week. Given a document, remove all the stop words. Associate a latent topic with each word in each document. This gives us a doc-topic distribution. In addition, we have a global word-topic distribution shared among all documents. In gibbs sampling, cycle through latent states, sample a new state, and update the corresponding entry in the global table. z5 z6 z7

Ex: Collapsed Gibbs Sampler for LDA Partitioning the model and data Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K

Ex: Collapsed Gibbs Sampler for LDA Get model parameters and compute update Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K Car Cat Tire Mouse get(“car”) get(“cat”) get(“tire”) get(“mouse”) This algorithm is naturally adapted to the parameter server abstraction. Split up word-topic table between different servers. Workers request the parameters they need from the corresponding servers. 4. Compute updates to the global word-topic table.

Ex: Collapsed Gibbs Sampler for LDA Send changes back to the parameter server Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K add(“tire”, δ) add(“mouse”, δ) add(“car”, δ) add(“cat”, δ) 1. Send updates to the server, which applies them to the global parameters. δ δ δ δ

Ex: Collapsed Gibbs Sampler for LDA Adding a caching layer to collect updates Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K To avoid a network bottleneck, cache parameters on each worker. How to synchronize the cache is an area of interesting research, like ESSP and managed communications. Parameter Cache Car Dog Pig Bat Parameter Cache Cat Gas Zoo VW Parameter Cache Car Rim bmw $$ Parameter Cache Mac iOS iPod Cat

Experiment: Topic Model (LDA) Higher is better Experiment using Nytimes dataset, 100k words, 100 topics. Point is, simple implementation like above can already scale up to hundreds of cores. Dataset: NYTimes (100m tokens, 100k vocabularies, 100 topics) Collapsed Gibbs sampling Compute Cluster: 8 nodes, each with 64 cores (512 cores total) and 128GB memory ESSP converges faster and robust to staleness s [Dai et al 2015]

LDA Samplers Comparison Emphasize that innovations in big ML are the combination of algorithmic and systems innovations. Graphs show different samplers for LDA. LightLDA employs a constant-time sampler. This is an algorithmic innovation that pushes the limit by itself. [Yuan et al 2015]

Big LDA on Parameter Server 1. Combine algorithmic innovations with systems innovations like parameter servers. Collapsed Gibbs sampler Size: 50B tokens, 2000 topics, 5M vocabularies 1k~6k nodes [Li et al 2014]

LDA Scale Comparison YahooLDA (SparseLDA) [1] Parameter Server (SparseLDA)[2] Tencent Peacock (SparseLDA)[3] AliasLDA [4] PetuumLDA (LightLDA) [5] # of words (dataset size) 20M documents 50B 4.5B 100M 200B # of topics 1000 2000 100K 1024 1M # of vocabularies est. 100K[2] 5M 210K Time to converge N/A 20 hrs 6.6hrs/iterations 2 hrs 60 hrs # of machines 400 6000 (60k cores) 500 cores 1 (1 core) 24 (480 cores) Machine specs 10 cores, 128GB RAM 4 cores 12GB RAM 20 cores, 256GB RAM Parameter Server Result is some of the largest LDA experiments ever done. Notice three of five of these are done using parameter servers. [1] Ahmed, Amr, et al. "Scalable inference in latent variable models." WSDM, (2012). [2] Li, Mu, et al. "Scaling distributed machine learning with the parameter server." OSDI. (2014). [3] Wang, Yi, et al. "Towards Topic Modeling for Big Data." arXiv:1405.4402 (2014). [4] Li, Aaron Q., et al. "Reducing the sampling complexity of topic models." KDD, (2014). [5] Yuan, Jinhui, et al. “LightLDA: Big Topic Models on Modest Compute Clusters” arXiv:1412.1576 (2014).