John Canny Where to put block multi-vector algorithms?

Slides:



Advertisements
Similar presentations
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Advertisements

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Access Path Selection in a Relational Database Management System Selinger et al.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Warm Up Find the missing side. In the diagram, ΔABC is equilateral, CD is an altitude, and AC = 12. What relationship exists between Δ ADC and Δ BDC?
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
TIME DELAY DATABASE AND ECULIDEAN DISTANCES. TIME DELAY DATABASE Goals from last milestone: 1)Choose between Excel or Matlab to use for Database. Done!
Distributed Handler Architecture Beytullah Yildiz
Programming Languages Programming languages are a compromise between spoken language and formal math. They allow humans to communicate with computers at.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
南亚和印度.
1 Numerical Methods Solution of Systems of Linear Equations.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Complexity Analysis (Part I)
Chapter 8: Recursion Data Structures in Java: From Abstract Data Types to the Java Collections Framework by Simon Gray.
TensorFlow– A system for large-scale machine learning
Code Optimization.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction of BP & TRW-S
Chapter 4: Threads.
Routing Metrics for Wireless Mesh Networks
Basic Concepts Graphs For more notes and topics visit:
Doubly Linked Lists 6/3/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia,
Expressions and Assignment
Analysis of Algorithms
Chilimbi, et al. (2014) Microsoft Research
Parallel Databases.
Bridging the Data Science and SQL Divide for Practitioners
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
Dataflow analysis.
Spark Presentation.
Semantic Analysis with Emphasis on Name Analysis
Static Single Assignment
Harry Xu University of California, Irvine & Microsoft Research
课程名 编译原理 Compiling Techniques
Optimization Code Optimization ©SoftMoore Consulting.
Dept of Computer Science
Deep Learning Libraries
Intelligent Information System Lab
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Overview of TensorFlow
Amir Kamil and Katherine Yelick
Introduction to Spark.
structures and their relationships." - Linus Torvalds
Translators & Facilities of Languages
Syntax Analysis Sections :.
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
CS110: Discussion about Spark
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
External Joins Query Optimization 10/4/2017
Chapter 9: Graphs Basic Concepts
Parallel Analytic Systems
COMP60621 Fundamentals of Parallel and Distributed Systems
Amir Kamil and Katherine Yelick
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Introduction To Distributed Systems
Combinational Circuits
Programming Languages, Preliminaries, History & Evolution
Type Topic in here! Created by Educational Technology Network
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 9: Graphs Basic Concepts
Deep Learning Libraries
CS 239 – Big Data Systems Fall 2018
Introduction to Optimization
Research: Past, Present and Future
Complexity Analysis (Part I)
structures and their relationships." - Linus Torvalds
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems John Canny Where to put block multi-vector algorithms? Percentage of Future Work Diagram

Distributed NN training with “The” Parameter Server Computational Models: Imperative (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Declarative (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Distributed NN training with “The” Parameter Server Computational Models: Concrete (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Asynchronous (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Declarative vs. Imperative Languages Declarative Systems use structures to represent the computation: “dataflow” or “computation graphs” Optimization (via transformations that preserve the result) can be applied to the graphs before they are evaluated.

Declarative vs. Imperative Languages Declarative representations generally preferable. Difficulties: Can be expensive for small data blocks Can be harder to represent recurrent or dynamic structures, need to “unroll” loops etc.

Consistency Models Sequential: execution is equivalent to some sequential execution of the program, and each machine’s instructions are executed in an order consistent with the program. Eventual: after a value is updated, the new value is not available to other nodes immediately. But if there are no other updates, all nodes eventually get the updated value.

Forward-backward gradients All forward activations computed sequentially before backward gradients. L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L1 fwd L2 fwd L3 fwd L4 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L1 fwd L2 fwd L3 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L1 fwd L2 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L1 fwd

Performance Memory management: In-place: reference-count based memory reuse co-share: nodes share storage iff(?) they cannot be run in parallel.

Performance Scalability