Distributed Parameter Synchronization in DNN

Slides:

Advertisements

Similar presentations

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

SCALING SGD to Big dATA & Huge Models

PRESENTED BY: TING WANG PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore, Andreas Pamboris, Nathan.

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Recent Developments in Deep Learning Quoc V. Le Stanford University and Google.

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

Distributed Computing Group A Self-Repairing Peer-to-Peer System Resilient to Dynamic Adversarial Churn Fabian Kuhn Stefan Schmid Roger Wattenhofer IPTPS.

Overview Distributed vs. decentralized Why distributed databases

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

Comp 5013 Deep Learning Architectures Daniel L. Silver March,

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Network Support for Cloud Services Lixin Gao, UMass Amherst.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

1 The Google File System Reporter: You-Wei Zhang.

Distributed Asynchronous Bellman-Ford Algorithm

An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.

SU YUXIN JAN 20, 2014 Petuum: An Iterative-Convergent Distributed Machine Learning Framework.

A General Distributed Deep Learning Platform

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

DistBelief: Large Scale Distributed Deep Networks Quoc V. Le

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

1 Detecting and Reducing Partition Nodes in Limited-routing-hop Overlay Networks Zhenhua Li and Guihai Chen State Key Laboratory for Novel Software Technology.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Dual-Region Location Management for Mobile Ad Hoc Networks Yinan Li, Ing-ray Chen, Ding-chau Wang Presented by Youyou Cao.

A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

PETUUM A New Platform for Distributed Machine Learning on Big Data

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.

Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.

Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular.

Instructor: Phil Gibbons

TensorFlow– A system for large-scale machine learning

Large-scale Machine Learning

Chilimbi, et al. (2014) Microsoft Research

ECE 5424: Introduction to Machine Learning

Interactive Machine Learning with a GPU-Accelerated Toolkit

Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons,

Exploiting Bounded Staleness to Speed up Big Data Analytics

Systems for ML Clipper Gaia Training (TensorFlow)

Replication-based Fault-tolerance for Large-scale Graph Processing

Logistic Regression & Parallel SGD

Introduction to locality sensitive approach to distributed systems

Speaker : Lee Heon-Jong

CS639: Data Management for Data Science

TensorFlow: A System for Large-Scale Machine Learning

Presentation transcript:

Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)

Model Training Data several GBs of model size several layers millions of edges between two layers thousands of neurons per layer for the imagenet model, the neuron number varies from 56x56x96 to 6x6x256 in convolution layers and 4096 in fully-connected layers 2TB data in ImageNet for 22K classification Training Data TBs of data

DNN model training could take weeks or even more

What if we can train the DNN model in one day? It is still a dream If you wish to get the same error rate we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Fast training needs parallelism, even in a distributed fashion

Model Parallelism Model is partitioned and trained in parallel Model Training Data Machine

Model Parallelism Network traffic bounded Non-linear speedup Training Data Network traffic bounded Non-linear speedup Training is still slow with large data sets

Another dimension of parallelism, data parallelism, is required

Data Parallelism 1. Training data is partitioned, and multi-models are trained in parallel (1) Downpour: Asynchronous Distributed SGD (2) Sandblaster: Distributed L-BFGS 2. Intermediate trained results (model parameters) are synchronized

Outline Problem statement Design goals Design Evaluation

It is not a good idea to combine model training and model synchronization

Separate the model training and model synchronization Parameter Server Application Separate the model training and model synchronization Build a dedicated system PS (Parameter Server) to synchronize the intermediate model parameters DistBlief (NIPS 2012)

Outline Problem statement Design goals Design Evaluation

How to build a scalable, reliable and still efficient parameter server?

A Centralized Approach p’’ = p’ + ∆p’ p’ = p + ∆p Parameter Server ∆p’ ∆p p’ p Model workers Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent Bounded-Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv:1312.7869, communicated 30 Dec 2013). Asynchronous Stochastic Gradient Descent (A-SGD) Data

A Centralized Approach p’ = p + ∆p Parameter Server ∆p ∆p is vector or matrix with float type, rather than key-value pair p’ = p + ∆p is commutative and associate, which makes synchronization in bulk is possible Model workers Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent Bounded-Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv:1312.7869, communicated 30 Dec 2013). Data

However, it is non-scalable if large-scale model workers exist

…… Parameter Server Depends on: The size of model parameters (240MB) The model update rate (3times/s, thus 720MB/s) The number of model workers (overloaded if n is large) GPU scenario Model Workers Data Shards ∆pi ∆p1 ∆pn ……

…… Model parameter partition helps Parameter Server ∆pn Shards ∆p1 ∆pi Workers Data Shards ∆pi ∆p1 ∆pn …… Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv:1312.7651, communicated 30 Dec 2013).

… A local cache (model slaves) of model parameters helps Parameter Server Model Workers Data Shards ∆pi ∆p1 ∆pn … parameter master parameter slaves

However, parameter master may still be the bottleneck A decentralized (peer-2-peer) system design is motivated

And, what if faults happened?

…… Parameter Server 1. Networking delay or down Model Workers Data Shards ∆pi ∆p1 ∆pn …… 1. Networking delay or down 2. Machine crash and restart 3. Software crash, data lost, job preempted

Again, it is not reliable without fault-tolerance support A fault-tolerant system design is motivated

How about performance if staleness (consistency) is required?

Staleness is required p1 = p + ∆p1 Parameter Server ∆p1 p Model Workers Data Shards p ∆p1

p1 = p + ∆p1 Parameter Server ∆p2 Model Workers Data Shards p1 slower slowest fast Model Workers Data Shards

Staleness is required for fast model convergence Update by worker 1 Update by worker 2 Model synchronization With coordination initialization 𝑡 1 𝑡 2 𝑡 3 global optimal Without coordination (Worker 2 works on a over-staled model) initialization 𝑡 1 𝑡 2 𝑡 3 global optimal local optimal

The working pace of each worker should be coordinated Parameter Server Model Workers Data Coordinator L-BFGS

However, a centralized coordinator is costly, and the system performance (parallelism) is not fully exploited Balance between the system performance and model convergence rate is motivated

Outline Problem statement Design goals Design Evaluation

1. Each worker machine has a local parameter server (model replica), and the system is responsible for parameter synchronization

System Architecture … Parameter Server Reduced network traffic by only exchanging the accumulated updates (commutative and associative) Non-blocking of training Asynchronization Parameter Server

2. How to mutually exchange parameter updates between two connected local parameter servers, with fault-tolerance on network delay or even down?

… Parameter Server Pairwise fault-tolerant update exchange protocol

Pairwise Protocol Invariants … Pairwise fault-tolerant update exchange protocol p q r Node p’s “belief” of model (Θ𝑝) equals to its own contribution (𝑥𝑝) and contribution from its neighbors (φqp) Θ𝑝=𝑥𝑝+∑q∊Np φqp (1) 𝑥𝑝 φqp φrp

Pairwise Protocol Invariants … Pairwise fault-tolerant update exchange protocol p q r A node (p) also propagates updates to neighbor (q) from the contribution of itself and the accumulated updates from other neighbors (r) φpq = Θ𝑝 - φqp (2) Θ𝑝 - φqp

Pairwise Protocol Details

3. How about flow control?

Straightforward, just control the timing of synchronization via such as timer, the version gap, or even dynamic adjusted

4. How about the fault-tolerance?

NOT based on redundancy (multiple copies) Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013

Instead, get the history from its neighbors (Θ𝑝 - φqp ) Or, just keep the accumulated local updates in persistent store

Temporary outage Scheduled failure Permanent failure

Dynamic adding or removing of model replicas has the same logic as fault tolerance

5. How local parameter servers are connected (topology)?

The right topology is hard to determine for system, which depends on the application, such as model size, update rate, network bandwidth, the number of neighbors, etc. Therefore, topology configuration is motivated

Further more, as workers leaves and joins in, the right topology would be adjusted. For example, increasingly added model replicas would be helpful for DNN training Therefore, topology re-configuration is necessary

Master-slave … master Parameter Server Shortest propagation delay (one hop) But high workload in master master Parameter Server

Tree-based topology … Parameter Server Decentralized Longer propagation delay (multiple hops) Without bottleneck Parameter Server Decentralized

Scalability is sensitive to topology The parameter size is 12MB and each worker pushes updates of the same size once per second.

Topology affects staleness

6. And how to set the right staleness to balance the system performance and model convergence rate?

Application-defined staleness is supported, such as Best effort (no extra requirement) Maximal delayed time (block push if previous n pushes not complete) User-defined filters (only push significant update) SSP* (bound the max gap between the fastest and slowest worker) Bound the update version gap Bound the parameter value gap *. Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, Advances in Neural Information Processing Systems 27 (NIPS 2013).

Outline Problem statement Design goals Design Evaluation

Learning speed can be accelerated But there is still a long journey to get a better error rate

Recap Re-configurability is the king in system design The layered design is beautiful Pure p2p design Pairwise protocol Flow control Fault tolerance Node joining in or leaving Topology configurable Staleness configurable

Future work Parameter server design is not only for DNN, but also for general inference problems Generalized linear model with a single massive vector Topic model with sparse vectors Graphical model with plates The design is also works for areas other than machine learning The scenarios with structured data and the aggregation is both commutative and associative, such as Sensor network to get aggregated data

Related work Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large Scale Distributed Deep Networks, NIPS 2012 Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, NIPS 2013. Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent Bounded-Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv:1312.7869, communicated 30 Dec 2013). Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv:1312.7651, communicated 30 Dec 2013). Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013

Thanks! and Questions?

Backup