John Canny Where to put block multi-vector algorithms?

Slides:

Advertisements

Similar presentations

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Advertisements

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Access Path Selection in a Relational Database Management System Selinger et al.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Warm Up Find the missing side. In the diagram, ΔABC is equilateral, CD is an altitude, and AC = 12. What relationship exists between Δ ADC and Δ BDC?

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

TIME DELAY DATABASE AND ECULIDEAN DISTANCES. TIME DELAY DATABASE Goals from last milestone: 1)Choose between Excel or Matlab to use for Database. Done!

Distributed Handler Architecture Beytullah Yildiz

Programming Languages Programming languages are a compromise between spoken language and formal math. They allow humans to communicate with computers at.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

南亚和印度.

1 Numerical Methods Solution of Systems of Linear Equations.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Complexity Analysis (Part I)

Chapter 8: Recursion Data Structures in Java: From Abstract Data Types to the Java Collections Framework by Simon Gray.

TensorFlow– A system for large-scale machine learning

Code Optimization.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction of BP & TRW-S

Chapter 4: Threads.

Routing Metrics for Wireless Mesh Networks

Basic Concepts Graphs For more notes and topics visit:

Doubly Linked Lists 6/3/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia,

Expressions and Assignment

Analysis of Algorithms

Chilimbi, et al. (2014) Microsoft Research

Parallel Databases.

Bridging the Data Science and SQL Divide for Practitioners

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE

Dataflow analysis.

Spark Presentation.

Semantic Analysis with Emphasis on Name Analysis

Static Single Assignment

Harry Xu University of California, Irvine & Microsoft Research

课程名编译原理 Compiling Techniques

Optimization Code Optimization ©SoftMoore Consulting.

Dept of Computer Science

Deep Learning Libraries

Intelligent Information System Lab

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Overview of TensorFlow

Amir Kamil and Katherine Yelick

Introduction to Spark.

structures and their relationships." - Linus Torvalds

Translators & Facilities of Languages

Syntax Analysis Sections :.

Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.

CS110: Discussion about Spark

MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,

External Joins Query Optimization 10/4/2017

Chapter 9: Graphs Basic Concepts

Parallel Analytic Systems

COMP60621 Fundamentals of Parallel and Distributed Systems

Amir Kamil and Katherine Yelick

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Introduction To Distributed Systems

Combinational Circuits

Programming Languages, Preliminaries, History & Evolution

Type Topic in here! Created by Educational Technology Network

COMP60611 Fundamentals of Parallel and Distributed Systems

Chapter 9: Graphs Basic Concepts

Deep Learning Libraries

CS 239 – Big Data Systems Fall 2018

Introduction to Optimization

Research: Past, Present and Future

Complexity Analysis (Part I)

structures and their relationships." - Linus Torvalds

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems John Canny Where to put block multi-vector algorithms? Percentage of Future Work Diagram

Distributed NN training with “The” Parameter Server Computational Models: Imperative (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Declarative (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Distributed NN training with “The” Parameter Server Computational Models: Concrete (Numpy, Matlab, C++, Java): (a+b)*(c+d) means do “a+b”, then “c+d” then multiply Asynchronous (SQL, Spark, Caffe…): Take a formula as a *formal* (e.g. mathematical) description of what is to be done. Free to choose when and how to compute the result, e.g. may be easier to do: ac + ad + bc + bd some time later. Founders: Demis Hassabis, Shane Legg, Mustafa Suleyman Why was this milestone particularly worrying from an ethics point of view?

Declarative vs. Imperative Languages Declarative Systems use structures to represent the computation: “dataflow” or “computation graphs” Optimization (via transformations that preserve the result) can be applied to the graphs before they are evaluated.

Declarative vs. Imperative Languages Declarative representations generally preferable. Difficulties: Can be expensive for small data blocks Can be harder to represent recurrent or dynamic structures, need to “unroll” loops etc.

Consistency Models Sequential: execution is equivalent to some sequential execution of the program, and each machine’s instructions are executed in an order consistent with the program. Eventual: after a value is updated, the new value is not available to other nodes immediately. But if there are no other updates, all nodes eventually get the updated value.

Forward-backward gradients All forward activations computed sequentially before backward gradients. L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L5 bwd L1 fwd L2 fwd L3 fwd L4 fwd L5 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L4 bwd L1 fwd L2 fwd L3 fwd L4 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L3 bwd L1 fwd L2 fwd L3 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L2 bwd L1 fwd L2 fwd

Forward-backward gradients All forward activations computed sequentially before backward gradients. But backward gradients need only the previous gradient and own forward activation: L1 bwd L1 fwd

Performance Memory management: In-place: reference-count based memory reuse co-share: nodes share storage iff(?) they cannot be run in parallel.

Performance Scalability