TensorFlow: A System for Large-Scale Machine Learning

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Program Representations. Representing programs Goals.
ImageNet Classification with Deep Convolutional Neural Networks
Spark: Cluster Computing with Working Sets
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Precision Going back to constant prop, in what cases would we lose precision?
Sebastian Schelter, Venu Satuluri, Reza Zadeh
GraphLab A New Framework for Parallel Machine Learning
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
6.5.4 Back-Propagation Computation in Fully-Connected MLP.
Marilyn Wolf1 With contributions from:
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
TensorFlow The Deep Learning Library You Should Be Using.
NFV Compute Acceleration APIs and Evaluation
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
TensorFlow– A system for large-scale machine learning
Deep Learning Software: TensorFlow
Analysis of Sparse Convolutional Neural Networks
Big Data A Quick Review on Analytical Tools
CS427 Multicore Architecture and Parallel Computing
基于多核加速计算平台的深度神经网络 分割与重训练技术
Chilimbi, et al. (2014) Microsoft Research
Spark Presentation.
Presented by Haoran Wang
Parallel Programming By J. H. Wang May 2, 2017.
Deep Learning Libraries
dawn.cs.stanford.edu/benchmark
Intelligent Information System Lab
A VERY Brief Introduction to Convolutional Neural Network using TensorFlow 李 弘
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
TensorFlow and Clipper (Lecture 24, cs262a)
INF 5860 Machine learning for image classification
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Introduction to Tensorflow
An open-source software library for Machine Intelligence
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Declarative Transfer Learning from Deep CNNs at Scale
Programming Languages
Speakers: Luo Mai, Jingqing Zhang
Overview of Workflows: Why Use Them?
Deploy Tensorflow on PySpark
5/7/2019 Map Reduce Map reduce.
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Deep Learning Libraries
CS 239 – Big Data Systems Fall 2018
MapReduce: Simplified Data Processing on Large Clusters
Search-Based Approaches to Accelerate Deep Learning
Presentation transcript:

TensorFlow: A System for Large-Scale Machine Learning

Background: Training deep neural networks Limited by GPU memory using Nvidia GTX 580 (3GB RAM) 60M Parameters ~ 240 MB Need to cache activation maps for backpropagation Batch size = 128 128 * (227*227*3 + 55*55*96*2 + 96*27*27 + 256*27*27*2 + 256*13*13 + 13*13*384 + 384*13*13 + 256*13*13 + 4096 + 4096 + 1000) Parameters ~ 718MB That assuming no overhead and single precision values Tuned splitting across GPUS to balance communication and computation

Background: Training deep neural networks Limited by GPU memory using Nvidia GTX 580 (3GB RAM) 60M Parameters ~ 240 MB Need to cache activation maps for backpropagation Batch size = 128 128 * (227*227*3 + 55*55*96*2 + 96*27*27 + 256*27*27*2 + 256*13*13 + 13*13*384 + 384*13*13 + 256*13*13 + 4096 + 4096 + 1000) Parameters ~ 718MB That assuming no overhead and single precision values Tuned splitting across GPUS to balance communication and computation Too much manual effort!

Background: Training deep neural networks Trend towards distributed training for large-scale models Parameter server: A shared key-value storage abstraction for distributed training E.g., DistBelief, Project Adam

Background: Training deep neural networks Hides details of distribution But still difficult to reason about end-to-end structure Inflexible update mechanisms and state management Trend towards distributed training for large-scale models Parameter server: A shared key-value storage abstraction for distributed training E.g., DistBelief, Project Adam

TensorFlow Flexible dataflow-based programming model for machine learning Dataflow captures natural structure of computation in both training and inference

TensorFlow

TensorFlow

TensorFlow

TensorFlow

What is the problem being solved? Lack of a flexible programming model to build machine learning models Prior approaches restricted innovation due to their inflexibility E.g., parameter updates in parameter server-based approaches

Dataflow-based programming model Computation structured as a dataflow graph Nodes can be stateful Captures accumulated state as part of the training process E.g., parameter values Graph elements Tensors flow across edges between nodes Operations are expressions over tensors (e.g., constants, matrix multiplication, add) Variables can accumulate state Queues provide explicit advanced coordination

TensorFlow

TensorFlow

What are the metrics of success? Variety of specialized extensions built over the framework “User level” code Acceptable performance with respect to state-of-the-art

Extensibility Optimization algorithms Momentum, AdaGrad, AdaDelta, Adam E.g., parameter updates in momentum are based on accumulated state over multiple iterations Difficult to implement extensible optimization algorithms in parameter servers

Extensibility Sharding very large models E.g., Sparse embedding layers Shard embedding layer across parameter server tasks Encode incoming indices as tensors, and ship to the appropriate shard

Extensibility Use queues to coordinate the execution of workers Synchronous replication Straggler mitigation with backup workers

Competitive training times on single node

Key results Extensibility matters!

Key results Extensibility matters!

Limitations and scope for improvement TF’s high-level programming model is tightly coupled with its execution model Translate TF programs to more efficient executables using compilation to hide translation TF dataflow graphs are static Key runtime decisions, such as number of PS shards, seem to require manual specification Can these be automatically deduced based on workload characteristics?