Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.

Slides:

Advertisements

Similar presentations

Presented by Fault Tolerance Working Group Update Rich Graham.

Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Distributed Parameter Synchronization in DNN

Enterprise Job Scheduling for Clustered Environments Stratos Paulakis, Vassileios Tsetsos, and Stathes Hadjiefthymiades P ervasive C omputing R esearch.

Distributed Graph Processing Abhishek Verma CS425.

Spark: Cluster Computing with Working Sets

PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

Centralized Architectures

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

PRASHANTHI NARAYAN NETTEM.

Lecture 6: Introduction to Distributed Computing.

1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.

Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.

SU YUXIN JAN 20, 2014 Petuum: An Iterative-Convergent Distributed Machine Learning Framework.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.

1 A Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers Presentation by Amitayu Das.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

What is Eclipse? Official Definition: Eclipse Evolution

GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

J ICOS’s Abstract Distributed Service Component Peter Cappello Computer Science Department UC Santa Barbara.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

PETUUM A New Platform for Distributed Machine Learning on Big Data

Data Structures and Algorithms in Parallel Computing Lecture 4.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris

Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.

Exercises for Chapter 2: System models From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Pearson Education 2005.

Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,

1 Thierry Titcheu Chekam 1,2, Ennan Zhai 3, Zhenhua Li 1, Yong Cui 4, Kui Ren 5 1 School of Software, TNLIST, and KLISS MoE, Tsinghua University 2 Interdisciplinary.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.

Parameter Servers (slides courtesy of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei) 1. Introduce self. 2. Parameter servers are very popular.

Hathi: Durable Transactions for Memory using Flash

TensorFlow– A system for large-scale machine learning

Accelerating Machine Learning with Model-Centric Approach on Emerging Architectures Many-Task Computing on Clouds, Grids, and Supercomputers Workshop.

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Chilimbi, et al. (2014) Microsoft Research

Exploiting Bounded Staleness to Speed up Big Data Analytics

Systems for ML Clipper Gaia Training (TensorFlow)

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)

Distributed P2P File System

CS110: Discussion about Spark

The Google File System (GFS)

Scalable Parallel Interoperable Data Analytics Library

Logistic Regression & Parallel SGD

The Google File System (GFS)

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

The Google File System (GFS)

The Google File System (GFS)

Declarative Transfer Learning from Deep CNNs at Scale

The Google File System (GFS)

by Mikael Bjerga & Arne Lange

The Google File System (GFS)

Presentation transcript:

Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip B. GibbonsGarth A. GibsonEric P. Xing CMU, Infocomm Research, A*STAR, Intel Labs Presenter: Xuewei Zhang

Iteratively Refined Machine Learning Model Expert-suggested model Iteratively refining parameters Stochastic gradient descent

Data Parallelism and Consistency Data Parallelism ◦Partition training data across machines ◦Shared access to model parameters Bulk Synchronous Parallelization: ◦MapReduce-like, barrier before each iteration Total Asynchronous Parallelization: ◦Out-of-sync Bounded Staleness Parallelization: ◦ Staleness bounded by threshold

Parameter Server and Bosen API

System Architecture parameter cache update buffer worker threads ClientServer communication threads master copy concurrent hash partition

Get() parameter cache update buffer worker threads ClientServer communication threads master copy push callback queue pop callback queue

Inc() parameter cache update buffer worker threads ClientServer communication threads master copy

Clock() ClientsServer master copy client clock(iteration t uploaded)

Clock() ClientsServer master copy client clock(iteration, patition downloaded)

Fault Tolerant Checkpointing at server Slow, only do checkpoint after time ◦Estimation: Ts ~ 20s, Tf ~ 1 year, N ~ 50 ◦Output: 1.4h

Managed Communication Band-width driven: leaky bucket model ◦Network fluctuation ◦16-bit float Update Prioritization for rows ◦Randomized ◦Round-Robin ◦Absolute Magnitude ◦Relative Magnitude

Adaptive Step Size Tuning Requirement: ◦1. Iteration ↑ step size↓ ◦2. Delay ↑ step size↓ Adaptive Revision ◦1. Gradient Change ↑ step size↓ ◦2. Delay ↑ step size↓ ◦3. Iteration ↑ step size↓ Parameter versioning ◦reduce message load User-defined stored procedure ◦apply weight over delay

Evaluation ML programs: ◦Matrix Factorization ◦Multiclass Logistic Regression ◦Topic Modeling (LDA) Cluster setup: ◦PRObE Nome, 16 cores on each machine, 32GB, 1Gb Ether, 20 GB Infiniband ◦PRObE Susitna, 64 cores on each machine, 128GB, 1GbE, 40GbE

Evaluation

1.Bandwidth 2.Prioritization

Evaluation 1.Priority changes frequency 2.Comparison between TAP

Evaluation 1.Time/Iteration vs Config 2.Bandwith vs Config

Evaluation 1.Time vs Config Randomized: Convergence TIme: 2.5X, 2.8X Iteration: 5.3X, 6.1X Prioritized: +25%

Evaluation Mini-Batch