TensorFlow: A System for Large-Scale Machine Learning

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Program Representations. Representing programs Goals.

ImageNet Classification with Deep Convolutional Neural Networks

Spark: Cluster Computing with Working Sets

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Precision Going back to constant prop, in what cases would we lose precision?

Sebastian Schelter, Venu Satuluri, Reza Zadeh

GraphLab A New Framework for Parallel Machine Learning

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

6.5.4 Back-Propagation Computation in Fully-Connected MLP.

Marilyn Wolf1 With contributions from:

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

TensorFlow The Deep Learning Library You Should Be Using.

NFV Compute Acceleration APIs and Evaluation

TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:

TensorFlow– A system for large-scale machine learning

Deep Learning Software: TensorFlow

Analysis of Sparse Convolutional Neural Networks

Big Data A Quick Review on Analytical Tools

CS427 Multicore Architecture and Parallel Computing

基于多核加速计算平台的深度神经网络分割与重训练技术

Chilimbi, et al. (2014) Microsoft Research

Spark Presentation.

Presented by Haoran Wang

Parallel Programming By J. H. Wang May 2, 2017.

Deep Learning Libraries

dawn.cs.stanford.edu/benchmark

Intelligent Information System Lab

A VERY Brief Introduction to Convolutional Neural Network using TensorFlow 李弘

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

TensorFlow and Clipper (Lecture 24, cs262a)

INF 5860 Machine learning for image classification

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Introduction to Tensorflow

An open-source software library for Machine Intelligence

MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Declarative Transfer Learning from Deep CNNs at Scale

Programming Languages

Speakers: Luo Mai, Jingqing Zhang

Overview of Workflows: Why Use Them?

Deploy Tensorflow on PySpark

5/7/2019 Map Reduce Map reduce.

August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab

Deep Learning Libraries

CS 239 – Big Data Systems Fall 2018

MapReduce: Simplified Data Processing on Large Clusters

Search-Based Approaches to Accelerate Deep Learning

Presentation transcript:

TensorFlow: A System for Large-Scale Machine Learning

Background: Training deep neural networks Limited by GPU memory using Nvidia GTX 580 (3GB RAM) 60M Parameters ~ 240 MB Need to cache activation maps for backpropagation Batch size = 128 128 * (227*227*3 + 55*55*96*2 + 96*27*27 + 256*27*27*2 + 256*13*13 + 13*13*384 + 384*13*13 + 256*13*13 + 4096 + 4096 + 1000) Parameters ~ 718MB That assuming no overhead and single precision values Tuned splitting across GPUS to balance communication and computation

Background: Training deep neural networks Limited by GPU memory using Nvidia GTX 580 (3GB RAM) 60M Parameters ~ 240 MB Need to cache activation maps for backpropagation Batch size = 128 128 * (227*227*3 + 55*55*96*2 + 96*27*27 + 256*27*27*2 + 256*13*13 + 13*13*384 + 384*13*13 + 256*13*13 + 4096 + 4096 + 1000) Parameters ~ 718MB That assuming no overhead and single precision values Tuned splitting across GPUS to balance communication and computation Too much manual effort!

Background: Training deep neural networks Trend towards distributed training for large-scale models Parameter server: A shared key-value storage abstraction for distributed training E.g., DistBelief, Project Adam

Background: Training deep neural networks Hides details of distribution But still difficult to reason about end-to-end structure Inflexible update mechanisms and state management Trend towards distributed training for large-scale models Parameter server: A shared key-value storage abstraction for distributed training E.g., DistBelief, Project Adam

TensorFlow Flexible dataflow-based programming model for machine learning Dataflow captures natural structure of computation in both training and inference

TensorFlow

TensorFlow

TensorFlow

TensorFlow

What is the problem being solved? Lack of a flexible programming model to build machine learning models Prior approaches restricted innovation due to their inflexibility E.g., parameter updates in parameter server-based approaches

Dataflow-based programming model Computation structured as a dataflow graph Nodes can be stateful Captures accumulated state as part of the training process E.g., parameter values Graph elements Tensors flow across edges between nodes Operations are expressions over tensors (e.g., constants, matrix multiplication, add) Variables can accumulate state Queues provide explicit advanced coordination

TensorFlow

TensorFlow

What are the metrics of success? Variety of specialized extensions built over the framework “User level” code Acceptable performance with respect to state-of-the-art

Extensibility Optimization algorithms Momentum, AdaGrad, AdaDelta, Adam E.g., parameter updates in momentum are based on accumulated state over multiple iterations Difficult to implement extensible optimization algorithms in parameter servers

Extensibility Sharding very large models E.g., Sparse embedding layers Shard embedding layer across parameter server tasks Encode incoming indices as tensors, and ship to the appropriate shard

Extensibility Use queues to coordinate the execution of workers Synchronous replication Straggler mitigation with backup workers

Competitive training times on single node

Key results Extensibility matters!

Key results Extensibility matters!

Limitations and scope for improvement TF’s high-level programming model is tightly coupled with its execution model Translate TF programs to more efficient executables using compilation to hide translation TF dataflow graphs are static Key runtime decisions, such as number of PS shards, seem to require manual specification Can these be automatically deduced based on workload characteristics?