Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Lecture 14 – Neural Networks
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Chapter 17 Parallel Processing.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Presenters: Rezan Amiri Sahar Delroshan
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CS161 – Design and Architecture of Computer
Scalpel: Customizing DNN Pruning to the
Hathi: Durable Transactions for Memory using Flash
TensorFlow– A system for large-scale machine learning
Analysis of Sparse Convolutional Neural Networks
CS161 – Design and Architecture of Computer
Chilimbi, et al. (2014) Microsoft Research
Computer Science and Engineering, Seoul National University
Alternative system models
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
CMSC 611: Advanced Computer Architecture
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Neural Networks Geoff Hulten.
High Performance Computing
by Mikael Bjerga & Arne Lange
Database System Architectures
TensorFlow: A System for Large-Scale Machine Learning
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Project Adam: Building an Efficient and Scalable Deep Learning Training System Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, Microsoft Research Published in the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation Presented by Alex Zahdeh Some figures adapted from OSDI 2014 presentation

Traditional Machine Learning

Deep Learning Humans Objective Function Data Deep Learning Prediction

Deep Learning

Problem with Deep Learning

Problem with Deep Learning Current computational needs on the order of petaFLOPS!

Accuracy scales with data and model size

Neural Networks http://neuralnetworksanddeeplearning.com/images/tikz11.png

Convolutional Neural Networks http://colah.github.io/posts/2014-07-Conv-Nets-Modular/img/Conv2-9x5-Conv2Conv2.png

Convolutional Neural Networks with Max Pooling http://colah.github.io/posts/2014-07-Conv-Nets-Modular/img/Conv-9-Conv2Max2Conv2.png

Neural Network Training (with Stochastic Gradient Descent) Inputs processed one at a time in random order with three steps: Feed-forward evaluation Back propagation Height updates

Project Adam Optimizing and balancing both computation and communication for this application through whole system co- design Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy

Adam System Architecture Fast Data Serving Model Training Global Parameter Server

Fast Data Serving Large quantities of data needed (10-100TBs) Data requires transformation to prevent over-fit Small set of machines configured separately to perform transformations and serve data Data servers pre-cache images using nearly all of system memory as a cache Model training machines fetch data in advance in batches in the background

Model Training Models partitioned vertically to reduce cross machine communication

Multi Threaded Training Multiple threads on a single machine Different images assigned to threads that share model weights Per-thread training context stores activations and weight update values Training context pre-allocated to avoid heap locks NUMA Aware

Fast Weight Updates Weights updated locally without locks Race condition permitted Weight updates are commutative and associative Deep neural networks are resilient to small amounts of noise Important for good scaling

Reducing Memory Copies Pass pointers rather than copying data for local communication Custom network library for non local communication Exploit knowledge of the static model partitioning to optimize communication Reference counting to ensure safety under asynchronous network IO

Memory System Optimizations Partition so that model layers fit in L3 cache Optimize computation for cache locality Forward and Back propagation have different row-major/column- major preferences Custom assembly kernels to appropriately pack a block of data so that vector units are fully utilized

Mitigating the Impact of Slow Machines Allow threads to process multiple images in parallel Use a dataflow framework to trigger progress on individual images based on arrival of data from remote machines At end of epoch, only wait for 75% of the model replicas to complete Arrived at through empirical observation No impact on accuracy

Parameter Server Communication Two protocols for communicating parameter weight updates Locally compute and accumulate weight updates and periodically send them to the server Works well for convolutional layers since the volume of weights is low due to weight sharing Send the activation and error gradient vectors to the parameter servers so that weight updates can be computed there Needed for fully connected layers due to the volume of weights. This reduces traffic volume from M*N to K*(M+N)

Global Parameter Server Rate of updates too high for a conventional key value store Model parameters divided into 1 MB shards Improves spatial locality of update processing Shards hashed into storage buckets distributed equally among parameter servers Helps with load balancing

Global Parameter Server Throughput Optimizations Takes advantage of processor vector instructions Processing is NUMA aware Lock free data structures Speeds up IO processing Lock free memory allocation Buffers allocated from pools of specified size (powers of 2 from 4KB to 32MB)

Delayed Persistence Parameter storage modelled as write back cache Dirty chunks flushed asynchronously Potential data loss tolerable by Deep Neural Networks due to their inherent resilience to noise Updates can be recovered if needed by retraining the model Allows for compression of writes due to additive nature of weight updates Store the sum, not the summands Can fold in many updates before flushing to storage

Fault Tolerance Three copies of each parameter shard One primary, two secondaries Parameter Servers controlled by a set of controller machines Controller machines form a Paxos cluster Controller stores the mapping of roles to parameter servers Clients contact controller to determine request routing Controller hands out bucket assignments Lease to primary, primary lease information to secondaries

Fault Tolerance Primary accepts requests for parameter updates for all chunks in a bucket Primary replicates changes to secondaries using 2PC Secondaries check lease information before committing Parameter server send heartbeats to secondaries In absence of a heartbeat, a secondary intitiates a role change proposal Controller elects a secondary as a primary

Communication Isolation Update processing and durability decoupled Separate 10Gb NICs are used for each of the paths Maximize bandwidth, minimize interference

Evaluation Visual Object Recognition Benchmarks System Hardware Baseline Performance and Accuracy System Scaling and Accuracy

Visual Object Recognition Benchmarks MNIST digit recognition http://cs.nyu.edu/~roweis/data/mnist_train1.jpg

Visual Object Recognition Benchmarks ImageNet 22k Image Classification American Foxhound English Foxhound http://www.exoticdogs.com/breeds/english-fh/4.jpg http://www.juvomi.de/hunde/bilder/m/FOXEN01M.jpg

System Hardware 120 HP Proliant servers Each server has an Intel Xeon E5-2450L processor 16 core, 1.8GHZ Each server has 98GB of main memory, two 10Gb NICs, one 1 Gb NIC 90 model training machines, 20 parameter servers, 10 image servers 3 racks each of 40 servers, connected by IBM G8264 switches

Baseline Performance and Accuracy Single model training machine, single parameter server. Small model on MNIST digit classification task

Model Training System Baseline

Parameter Server Baseline

Model Accuracy Baseline

System Scaling and Accuracy Scaling with Model Workers Scaling with Model Replicas Trained Model Accuracy

Scaling with Model Workers

Scaling with Model Replicas

Trained Model Accuracy at Scale

Trained Model Accuracy at Scale

Summary Pros World record accuracy on large scale benchmarks Highly optimized and scalable Fault tolerant Cons Thoroughly optimized for Deep Neural Networks; Unclear if it can be applied to other models

Questions?