MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems John Canny Where to put block multi-vector algorithms? Percentage of Future Work Diagram
Distributed NN training with “The” Parameter Server Computational Models: Imperative (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Declarative (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?
Distributed NN training with “The” Parameter Server Computational Models: Concrete (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Asynchronous (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?
Declarative vs. Imperative Languages Declarative Systems use structures to represent the computation: “dataflow” or “computation graphs” Optimization (via transformations that preserve the result) can be applied to the graphs before they are evaluated.
Declarative vs. Imperative Languages Declarative representations generally preferable. Difficulties: Can be expensive for small data blocks Can be harder to represent recurrent or dynamic structures, need to “unroll” loops etc.
Consistency Models Sequential: execution is equivalent to some sequential execution of the program, and each machine’s instructions are executed in an order consistent with the program. Eventual: after a value is updated, the new value is not available to other nodes immediately. But if there are no other updates, all nodes eventually get the updated value.
Forward-backward gradients All forward activations computed sequentially before backward gradients. L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd
Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd
Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L1 fwd L2 fwd L3 fwd L4 fwd
Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L1 fwd L2 fwd L3 fwd
Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L1 fwd L2 fwd
Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L1 fwd
Performance Memory management: In-place: reference-count based memory reuse co-share: nodes share storage iff(?) they cannot be run in parallel.
Performance Scalability