Presentation is loading. Please wait.

Presentation is loading. Please wait.

John Canny Where to put block multi-vector algorithms?

Similar presentations


Presentation on theme: "John Canny Where to put block multi-vector algorithms?"— Presentation transcript:

1 MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
John Canny Where to put block multi-vector algorithms? Percentage of Future Work Diagram

2 Distributed NN training with “The” Parameter Server
Computational Models: Imperative (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Declarative (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

3 Distributed NN training with “The” Parameter Server
Computational Models: Concrete (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Asynchronous (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

4 Declarative vs. Imperative Languages
Declarative Systems use structures to represent the computation: “dataflow” or “computation graphs” Optimization (via transformations that preserve the result) can be applied to the graphs before they are evaluated.

5 Declarative vs. Imperative Languages
Declarative representations generally preferable. Difficulties: Can be expensive for small data blocks Can be harder to represent recurrent or dynamic structures, need to “unroll” loops etc.

6 Consistency Models Sequential: execution is equivalent to some sequential execution of the program, and each machine’s instructions are executed in an order consistent with the program. Eventual: after a value is updated, the new value is not available to other nodes immediately. But if there are no other updates, all nodes eventually get the updated value.

7 Forward-backward gradients
All forward activations computed sequentially before backward gradients. L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

8 Forward-backward gradients
All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

9 Forward-backward gradients
All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L1 fwd L2 fwd L3 fwd L4 fwd

10 Forward-backward gradients
All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L1 fwd L2 fwd L3 fwd

11 Forward-backward gradients
All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L1 fwd L2 fwd

12 Forward-backward gradients
All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L1 fwd

13 Performance Memory management: In-place: reference-count based memory reuse co-share: nodes share storage iff(?) they cannot be run in parallel.

14 Performance Scalability


Download ppt "John Canny Where to put block multi-vector algorithms?"

Similar presentations


Ads by Google