John Canny Where to put block multi-vector algorithms?

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
John Canny Where to put block multi-vector algorithms? Percentage of Future Work Diagram

Distributed NN training with “The” Parameter Server
Computational Models: Imperative (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Declarative (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Distributed NN training with “The” Parameter Server
Computational Models: Concrete (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Asynchronous (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Declarative vs. Imperative Languages
Declarative Systems use structures to represent the computation: “dataflow” or “computation graphs” Optimization (via transformations that preserve the result) can be applied to the graphs before they are evaluated.

Declarative vs. Imperative Languages
Declarative representations generally preferable. Difficulties: Can be expensive for small data blocks Can be harder to represent recurrent or dynamic structures, need to “unroll” loops etc.

Consistency Models Sequential: execution is equivalent to some sequential execution of the program, and each machine’s instructions are executed in an order consistent with the program. Eventual: after a value is updated, the new value is not available to other nodes immediately. But if there are no other updates, all nodes eventually get the updated value.

Forward-backward gradients
All forward activations computed sequentially before backward gradients. L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L1 fwd L2 fwd L3 fwd L4 fwd

All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L1 fwd L2 fwd L3 fwd

All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L1 fwd L2 fwd

All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L1 fwd

Performance Memory management: In-place: reference-count based memory reuse co-share: nodes share storage iff(?) they cannot be run in parallel.

Performance Scalability

John Canny Where to put block multi-vector algorithms?

Similar presentations

Presentation on theme: "John Canny Where to put block multi-vector algorithms?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Canny Where to put block multi-vector algorithms?

Similar presentations

Presentation on theme: "John Canny Where to put block multi-vector algorithms?"— Presentation transcript:

Similar presentations

About project

Feedback