Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular.

Similar presentations


Presentation on theme: "Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular."— Presentation transcript:

1 Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

2 Regret analysis for on-line optimization
RECAP Regret analysis for on-line optimization

3 RECAP 2009

4 RECAP f is loss function, x is parameters Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

5 Regret: how much loss was incurred during learning,
RECAP Regret: how much loss was incurred during learning, over and above the loss incurrent with an optimal choice of x Special case: ft is 1 if a mistake was made, 0 otherwise ft(x*) = 0 for optimal x* Regret = # mistakes made in learning

6 RECAP Theorem: you can find a learning rate so that the regret of delayed SGD is bounded by T = # timesteps τ= staleness > 0

7 RECAP Theorem 8: you can do better if you assume (1) examples are i.i.d. (2) the gradients are smooth, analogous to the assumption about L: Then you can show a bound on expected regret No-delay loss dominant term

8 RECAP Experiments

9 Parameter Servers (slides courtesy of Aurick Qiao
Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

10 Scalable Machine Learning Algorithms
ML Systems Scalable Machine Learning Algorithms Abstractions Scalable Systems

11 ML Systems Landscape Hadoop, GraphLab, Spark Tensorflow Bosen, DMTK,
Graph Systems Dataflow Systems Shared Memory Systems Model Very high-level overview of the current landscape in ML systems. First, dataflow systems such as Hadoop and spark. Data is independent records and push through a processing pipeline. Graph systems like Graphlab, etc. Problem is modeled as a graph, each node communicates with its neighbors. Distributed shared memory systems like Bosen, etc. Model is globally accessible and changed by external workers. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

12 ML Systems Landscape Algorithms Hadoop, GraphLab, Spark Tensorflow
Graph Systems Dataflow Systems Shared Memory Systems Model Algorithms 1. Each of these systems aim to support certain types of algorithms. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

13 ML Systems Landscape Graph Algorithms, SGD, Sampling Naïve Bayes,
Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] Graph systems: graph algorithms, graphical models such as HMM, etc. DSM: SGD, eg. SGD Logistic regression. Sampling, eg. Gibbs sampling LDA. GraphLab, Tensorflow Hadoop, Spark Bosen, DMTK, ParameterServer.org

14 ML Systems Landscape Abstractions Graph Algorithms, SGD, Sampling
Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] Abstractions 1. Each of these types of systems offer a different abstraction as well. GraphLab, Tensorflow Hadoop & Spark Bosen, DMTK, ParameterServer.org

15 ML Systems Landscape Naïve Bayes, Rocchio Graph Algorithms,
Graph Systems Dataflow Systems Shared Memory Systems Model Naïve Bayes, Rocchio Graph Algorithms, Graphical Models SGD, Sampling [NIPS’09, NIPS’13] PIG, GuineaPig, … Vertex-Programs Parameter Server Graph systems: Vertex programs, code runs on a vertex of the graph, communicate with its neighbors. DSM: Parameter servers. Model is stored on a globally-accessible parameter server, [UAI’10] [VLDB’10] GraphLab, Tensorflow Hadoop & Spark Bosen, DMTK, ParameterServer.org

16 ML Systems Landscape Parameter Server Graph Systems Dataflow Systems
Shared Memory Systems Model Parameters of the ML system are stored in a distributed hash table that is accessible thru the network [NIPS’09, NIPS’13] Parameter Server Graph systems: Vertex programs, code runs on a vertex of the graph, communicate with its neighbors. DSM: Parameter servers. Model is stored on a globally-accessible parameter server, Param Servers used in Google, Yahoo, …. Academic work by Smola, Xing, … [VLDB’10]

17 Parameter Servers Are Flexible
Implemented with Parameter Server Looking at the landscape, parameter servers are everywhere. Explain plot. Many companies use it, including Google, Yahoo, Microsoft.

18 Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory) Extensions: multiple keys (for a matrix); multiple “channels” (for multiple sparse vectors, multiple clients for same servers, …) [Smola et al 2010, Ho et al 2013, Li et al 2014]

19 Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Extensions: push/pull interface to send/receive most recent copy of (subset of) parameters, blocking is optional Extension: can block until push/pulls with clock < (t – τ ) complete [Smola et al 2010, Ho et al 2013, Li et al 2014]

20 Data parallel w1 w2 w3 w4 w5 w6 w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 w4
Parameter Server w1 w2 w3 Parameter Server w4 w5 w6 Parameter Server w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. w4 w6 w5 w1 w9 w8 w3 w7 w2 Data Data Data Data Split Data Across Machines

21 Data parallel w1 w2 w3 w4 w5 w6 w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9
Parameter Server w1 w2 w3 Parameter Server w4 w5 w6 Parameter Server w7 w8 w9 w1 w2 w3 w4 w5 w6 w7 w8 w9 Different parts of the model on different servers. Workers retrieve the part needed as needed Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. Data Data Data Data Split Data Across Machines

22 Parameter Server Abstraction
Key-Value API for workers: get(key)  value add(key, delta) Model Parameter servers use a simple key-value store API. Get: retrieve values from server, and generate delta. Add: send delta to servers and generate new model. Key advantage: algorithms look like single thread. No need to create complicated pipeline of joins, maps, etc. like in MapReduce. Model Model

23 PS vs Hadoop Map-Reduce
1. You all have used it, easy to program? Efficient system for ML?

24 Iteration in Map-Reduce (IPM)
w(0) Initial Model Map Reduce Learned Model w(1) Training Data w(2) Suspicion is that MR is not ideal, since this presentation is being given. Take a closer look at what an iterative ML algorithms running on MapReduce looks like. Take a large amount of data and load from disk to mappers, then reduce, then save the output back to disk. That’s not all, need to iterate again and again to converge. w(3)

25 Cost of Iteration in Map-Reduce
w(0) Initial Model Learned Model Map Reduce w(1) Read 1 Training Data Repeatedly load same data Read 2 w(2) 1. Disk is really slow and we’re loading the same data repeatedly. Read 3 w(3)

26 Cost of Iteration in Map-Reduce
w(0) Initial Model Learned Model Map Reduce w(1) Training Data Redundantly save output between stages w(2) 1. Saving the output to disk when they will be loaded again immediately, and we don’t even care about that intermediate value. w(3)

27 Parameter Servers Stale Synchronous Parallel Model
(slides courtesy of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular – Us, Alex Smola, Google, Yahoo, Microsoft… 3. Talk about latest developments.

28 Parameter Server (PS) Worker Machines Server Machines Data parallelism naturally leads to the parameter server abstraction. Data is distributed among worker machines. Parameters are distributed on parameter server machines, served to workers via a key-value interface. Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory) [Smola et al 2010, Ho et al 2013, Li et al 2014]

29 Iterative ML Algorithms
Model Parameters Before going any further, look at some properties of machine learning algorithms. Take data and current parameters, compute update to model, apply update to model to obtain better model. Iterative: do it over and over again until convergence. Data Worker Topic Model, matrix factorization, SVM, Deep Neural Network…

30 ? Map-Reduce Data Model Execution Semantics Programming Abstraction
Parameter Server VS. Data Model Independent Records Independent Data Programming Abstraction Key-Value Store (Distributed Shared Memory) Map & Reduce Ask for questions here. Summarize slide. Only remaining question: execution semantics for parameter server? Execution Semantics Bulk Synchronous Parallel (BSP) ?

31 The Problem: Networks Are Slow!
get(key) add(key, delta) Worker Machines Server Machines Big problem in distributed systems is always the network. To avoid network bottleneck, need to cache parameters locally on workers. Synchronization mechanism is key in determining performance, as we will see later. Network is slow compared to local memory access We want to explore options for handling this…. [Smola et al 2010, Ho et al 2013, Li et al 2014]

32 Solution 1: Cache Synchronization
Data Data Data Server Example of worker cache synchronization. Data Data Data

33 Parameter Cache Synchronization
Sparse Changes to Model Data Data Data Server Data Data Data

34 Parameter Cache Synchronization
(aka IPM) Data Data Data Server Data Data Data

35 Solution 2: Asynchronous Execution
Compute Communicate Compute Barrier Barrier Machine 1 Machine 2 Machine 3 Iteration Waste Iteration Iteration Iteration Waste 1. Previous animation is an example of bulk synchronous execution. 2. Other end of the spectrum: asynchronous. Remove barriers, communicate and compute whenever each worker likes. Iteration Waste Iteration Enable more frequent coordination on parameter values

36 Asynchronous Execution
Parameter Server (Logical) w1 w2 w3 w4 w5 w6 w7 w8 w9 Iteration Machine 1 Complete overlap of communication and computation, no idle resources. Maximize throughput. Machine 2 Machine 3 [Smola et al 2010]

37 Asynchronous Execution
Problem: Async lacks theoretical guarantee as distributed environment can have arbitrary delays from network & stragglers But….

38 RECAP f is loss function, x is parameters Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

39 ? Map-Reduce Data Model Execution Bounded Semantics Asynchronous
Parameter Server VS. Data Model Independent Records Independent Data Programming Abstraction Key-Value Store (Distributed Shared Memory) Map & Reduce 1. We propose bounded asynchronous. Execution Semantics Bounded Asynchronous Bulk Synchronous Parallel (BSP) ?

40 Bounded Asynchronous Parameter Server
Stale synchronous parallel (SSP): Global clock time t Parameters workers “get” can be out of date but can’t be older than t-τ τ controls “staleness” aka stale synchronous parallel (SSP) 1. We propose bounded asynchronous. Bounded Asynchronous

41 Stale Synchronous Parallel (SSP)
Allow workers to proceed at their own pace as long as they are within an acceptable threshold of each other. Black: all workers see these updates. Blue: unspecified, workers may see it or not. Worker 1 is blocked. Straggler-tolerant up to a certain threshold. Allows communication and computation to happen concurrently. Efficiently implementable. Ask for questions here. Interpolate between BSP and Async and subsumes both Allow workers to usually run at own pace Fastest/slowest threads not allowed to drift >s clocks apart Efficiently implemented: Cache parameters [Ho et al 2013]

42 Consistency Matters Suitable delay (SSP) gives big speed-up
Consistency: how recent workers views of shared variables are. BSP – consistent, async – not consistent. Consistency really matters. LDA experiment Reduced network time with relaxed consistency. Strong consistency Relaxed consistency Suitable delay (SSP) gives big speed-up [Ho et al 2013]

43 Stale Synchronous Parallel (SSP)
LDA experiment. Log likelihood vs time. SSP with correct staleness value is better than Async and BSP. SSP retains theoretical convergence guarantees. [Ho et al 2013]

44 Beyond the PS/SSP Abstraction…

45 Managed Communications
Barrier sync – computation stalls here compute compute compute BSP compute compute compute SSP Further piece of work: managed communications. SSP prevents computation from stalling. However, network can be idle for much of the time. time Stalls only if certain previous sync did not complete BSP stalls during communication. SSP is able to overlap communication and computation….but network can be underused [Wei et al 2015]

46 Managed Communications
Problem: Limited, spare network bandwidth exists Fresher values make computation more effective Use the spare bandwidth but no more and use it well, how? compute compute compute Existing: Problem is there is spare, but limited bandwidth. Know that fresher values make computations more effective. Knowing these facts, how can we more effectively make use of bandwidth to speed up the system? Solution is to actively manage communications in an intelligent manner. Solution: System framework for managing communication Automatically improves performance w/ spare bandwidth burst of traffic idle network [Wei et al 2015]

47 Bosen: choosing model partition
server 1 server 2 Model partition Model partition worker 1 worker 2 ML App Client lib ML App Client lib First, a bit on Bosen’s architecture. Servers store and serve partitions of the model. Workers Store partitions of data, and the ML app does computations using the data and model. Client library takes care of synchronizing parameter caches according to SSP. Data partition Data partition Parameter Server [Power’10] [Ahmed’12] [Ho’13] [Li’14] Coherent shared memory abstraction for application Let the library worry about consistency, communication, etc [Wei et al 2015]

48 Bosen: Other Ways To Manage Communication
Model parameters are not equally important E.g. Majority of the parameters may converge in a few iteration. Communicate the more important parameter values or updates Magnitude of the changes indicates importance Magnitude-based prioritization strategies Example: Relative-magnitude prioritization Intuition: Model parameters are not all equally important. Majority converge within a few iterations. Only communicate important parameters, ie. Those that are changing quickly. Can prioritize based on update magnitude. [Wei et al 2015] 48

49 Recap: What Problems Are We Solving?

50 Iterative ML Algorithms
L: loss Δ: grad A: params at time t D: data F: update Before going any further, look at some properties of machine learning algorithms. Take data and current parameters, compute update to model, apply update to model to obtain better model. Iterative: do it over and over again until convergence. Many ML algorithms are iterative-convergent Examples: Optimization, sampling methods Topic Model, matrix factorization, SVM, Deep Neural Network…

51 Iterative ML with a Parameter Server: (1) Data Parallel
Δ: grad of L D: data, shard p Usually add here 1. This is a natural way to parallelize ML algorithms. Split up data, each worker use a subset of the data to generate updates to the globally shared parameters. assume i.i.d Each worker assigned a data partition Model parameters are shared by workers Workers read and update the model parameters

52 (2) Model parallel Different parts of the model can be stored on different parameter servers. Workers just retrieve the parts it needs from the corresponding machines. ignore D as well as L Sp is a scheduler for processor p that selects params for p

53 Parameter Server – scheduling…
Optional scheduling interface schedule(key)  param keys svars push(p=workerId, svars)  changed key pull(svars, updates=(push1,….,pushn)) Parameter servers use a simple key-value store API. Get: retrieve values from server, and generate delta. Add: send delta to servers and generate new model. Key advantage: algorithms look like single thread. No need to create complicated pipeline of joins, maps, etc. like in MapReduce. Worker machines Scheduler machines

54 Support for model-parallel programs

55 Similar to signal-collect: schedule() defines graph,
workers push params to scheduler, scheduler pulls to aggregate, and makes params available via get() and inc()

56 A Data Parallel Example

57 About: Distance metric learning
Instance: pairs (x1,x2) Label: similar or dissimilar Model: scale x1 and x2 with matrix L, try and minimize distance ||Lx1 – Lx2||2 for similar pairs and max(0, 1-||Lx1 – Lx2||2 ) for dissimilar pairs using x,y instead of x1,x2

58 Example: Data parallel SGD
Could also get only keys I need

59

60 A Model Parallel Example: Lasso

61 Regularized logistic regression
Replace log conditional likelihood LCL with LCL + penalty for large weights, eg alternative penalty:

62 Regularized logistic regression
shallow grad near 0 steep grad near 0 L1-regularization pushes parameters to zero: sparse

63 SGD Repeat for t=1,….,T For each example Compute gradient of regularized loss (for that example) Move all parameters in that direction (a little)

64 Coordinate descent Repeat for t=1,….,T
For each parameter j Compute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)

65 Stochastic coordinate descent
Repeat for t=1,….,T Pick a random parameter j Compute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)

66 Parallel stochastic coordinate descent (shotgun)
Repeat for t=1,….,T Pick several coordinates j1,…,jp in parallel Compute gradient of regularized loss (for each parameter jk) Move each parameter jk

67 Parallel coordinate descent (shotgun)

68 Parallel coordinate descent (shotgun)

69 Example: Model parallel SGD
Basic ideas: Pick parameters stochastically Prefer large parameter values (i.e., ones that haven’t converged) Prefer nearly-independent parameters

70 Example: Model parallel SGD

71 Example: Model parallel SGD

72

73 Case Study: Topic Modeling with LDA
Summarize with a case study. Last homework will be to implement LDA on a parameter server.

74 Example: Topic Modeling with LDA
Word-Topic Dist. Local Variables Documents Tokens Maintained by the Parameter Server Maintained by the Workers Nodes

75 Gibbs Sampling for LDA Word-Topic Dist’n Brains: Choose: Direction:
Feet: Head: Shoes: Steer: Title: Oh, The Places You’ll Go! Doc-Topic Distribution θd Doc-Topic Distribution θd You have brains in your head. You have feet in your shoes. You can steer yourself any direction you choose. z1 z2 z3 z4 Gibbs sampling for LDA, you all have seen this last week. Given a document, remove all the stop words. Associate a latent topic with each word in each document. This gives us a doc-topic distribution. In addition, we have a global word-topic distribution shared among all documents. In gibbs sampling, cycle through latent states, sample a new state, and update the corresponding entry in the global table. z5 z6 z7

76 Ex: Collapsed Gibbs Sampler for LDA
Partitioning the model and data Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K

77 Ex: Collapsed Gibbs Sampler for LDA
Get model parameters and compute update Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K Car Cat Tire Mouse get(“car”) get(“cat”) get(“tire”) get(“mouse”) This algorithm is naturally adapted to the parameter server abstraction. Split up word-topic table between different servers. Workers request the parameters they need from the corresponding servers. 4. Compute updates to the global word-topic table.

78 Ex: Collapsed Gibbs Sampler for LDA
Send changes back to the parameter server Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K add(“tire”, δ) add(“mouse”, δ) add(“car”, δ) add(“cat”, δ) 1. Send updates to the server, which applies them to the global parameters. δ δ δ δ

79 Ex: Collapsed Gibbs Sampler for LDA
Adding a caching layer to collect updates Parameter Server Parameter Server Parameter Server W1:10K W10k:20K W20k:30K To avoid a network bottleneck, cache parameters on each worker. How to synchronize the cache is an area of interesting research, like ESSP and managed communications. Parameter Cache Car Dog Pig Bat Parameter Cache Cat Gas Zoo VW Parameter Cache Car Rim bmw $$ Parameter Cache Mac iOS iPod Cat

80 Experiment: Topic Model (LDA)
Higher is better Experiment using Nytimes dataset, 100k words, 100 topics. Point is, simple implementation like above can already scale up to hundreds of cores. Dataset: NYTimes (100m tokens, 100k vocabularies, 100 topics) Collapsed Gibbs sampling Compute Cluster: 8 nodes, each with 64 cores (512 cores total) and 128GB memory ESSP converges faster and robust to staleness s [Dai et al 2015]

81 LDA Samplers Comparison
Emphasize that innovations in big ML are the combination of algorithmic and systems innovations. Graphs show different samplers for LDA. LightLDA employs a constant-time sampler. This is an algorithmic innovation that pushes the limit by itself. [Yuan et al 2015]

82 Big LDA on Parameter Server
1. Combine algorithmic innovations with systems innovations like parameter servers. Collapsed Gibbs sampler Size: 50B tokens, 2000 topics, 5M vocabularies 1k~6k nodes [Li et al 2014]

83 LDA Scale Comparison YahooLDA (SparseLDA) [1]
Parameter Server (SparseLDA)[2] Tencent Peacock (SparseLDA)[3] AliasLDA [4] PetuumLDA (LightLDA) [5] # of words (dataset size) 20M documents 50B 4.5B 100M 200B # of topics 1000 2000 100K 1024 1M # of vocabularies est. 100K[2] 5M 210K Time to converge N/A 20 hrs 6.6hrs/iterations 2 hrs 60 hrs # of machines 400 6000 (60k cores) 500 cores 1 (1 core) 24 (480 cores) Machine specs 10 cores, 128GB RAM 4 cores 12GB RAM 20 cores, 256GB RAM Parameter Server Result is some of the largest LDA experiments ever done. Notice three of five of these are done using parameter servers. [1] Ahmed, Amr, et al. "Scalable inference in latent variable models." WSDM, (2012). [2] Li, Mu, et al. "Scaling distributed machine learning with the parameter server." OSDI. (2014). [3] Wang, Yi, et al. "Towards Topic Modeling for Big Data." arXiv: (2014). [4] Li, Aaron Q., et al. "Reducing the sampling complexity of topic models." KDD, (2014). [5] Yuan, Jinhui, et al. “LightLDA: Big Topic Models on Modest Compute Clusters” arXiv: (2014).


Download ppt "Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular."

Similar presentations


Ads by Google