Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

Operating System.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.
Distributed Parameter Synchronization in DNN
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Parallel System Performance CS 524 – High-Performance Computing.
Introduction to Analysis of Algorithms
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Concurrency CS 510: Programming Languages David Walker.
CS533 - Concepts of Operating Systems
Device Management.
Bellevue University CIS 205: Introduction to Programming Using C++ Lecture 1: Getting Started by George Lamperti & BU Faculty.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Chapter 1: Introduction To Computer | SCP1103 Programming Technique C | Jumail, FSKSM, UTM, 2005 | Last Updated: July 2005 Slide 1 Introduction To Computers.
Switching, routing, and flow control in interconnection networks.
Memory/Storage Architecture Lab Computer Architecture Lecture Storage and Other I/O Topics.
Algorithms Describing what you know. Contents What are they and were do we find them? Why show the algorithm? What formalisms are used for presenting.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.
Introduction to Networked Graphics Part 3 of 5: Latency.
SU YUXIN JAN 20, 2014 Petuum: An Iterative-Convergent Distributed Machine Learning Framework.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
LazyBase: Trading freshness and performance in a scalable database Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
1 CS 501 Spring 2006 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Java Threads. What is a Thread? A thread can be loosely defined as a separate stream of execution that takes place simultaneously with and independently.
Lecture 35: Chapter 6 Today’s topic –I/O Overview 1.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Mahapatra-A&M-Fall'001 Co-design Finite State Machines Many slides of this lecture are borrowed from Margarida Jacome.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
West Virginia University Slide 1 Copyright © K.Goseva 2010 CS 736 Software Performance Engineering Comments on Homework #1  Please revise the solution.
1 Client-Server Interaction. 2 Functionality Transport layer and layers below –Basic communication –Reliability Application layer –Abstractions Files.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
O PERATING S YSTEM. What is an Operating System? An operating system is an event driven program which acts as an interface between a user of a computer,
Computer and Programming. Computer Basics: Outline Hardware and Memory Programs Programming Languages and Compilers.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)
 Kim  Allen  Kenneth. Chapter 1 Computer Fundamentals.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Agenda  Quick Review  Finish Introduction  Java Threads.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Identify internal hardware devices (e. g
Fall 2004 Backpropagation CS478 - Machine Learning.
Addressing the straggler problem for iterative convergent parallel ML
Real-time Software Design
Addressing the straggler problem for iterative convergent parallel ML
Exploiting Bounded Staleness to Speed up Big Data Analytics
PipeDream: Pipeline Parallelism for DNN Training
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 1: How are computers organized?
Background and Motivation
Parallel and Distributed Simulation
CS 6290 Many-core & Interconnect
Artificial Neural Networks
Presentation transcript:

Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*, Eric Xing PARALLEL DATA LABORATORY Carnegie Mellon University * HP Labs

Overview It’s time for all applications (and systems) to worry about data freshness Current focus: parallel machine learning Often limited by synchronization overhead What if we explicitly allow stale data?

A typical ML algorithm Input data

A typical ML algorithm Input data (1) initialization Intermediate Program state

A typical ML algorithm Input data (1) initialization Intermediate Program state (2) Iterate, many small updates

A typical ML algorithm Input data Output results (1) initialization Intermediate Program state (2) Iterate, many small updates (3) Output results after many iterations

A typical ML algorithm Input data Output results (1) initialization Intermediate Program state (2) Iterate, many small updates (3) Output results after many iterations

Parallel ML Generally follows bulk synchronous parallel model Many iterations of 1.Computation: compute new values 2.Synchronization: wait for all other threads 3.Communication: send new values to other threads 4.Synchronization: wait for all other threads… again

BSP Visualized All threads must be on the same iteration to continue Thread

Stragglers in BSP Slow thread(s) will hold up entire application Predictable stragglers – Slow/old machine – Bad network card – More data assigned to some threds

Stragglers in BSP Slow thread(s) will hold up entire application Predictable stragglers  Easy case – Slow/old machine – Bad network card – More data assigned to some threads

Stragglers in BSP Slow thread(s) will hold up entire application Predictable stragglers  Easy case Unpredictable stragglers  ???

Stragglers in BSP Slow thread(s) will hold up entire application Predictable stragglers  Easy case Unpredictable stragglers  ??? – Hardware: disk seeks, network, CPU interrupts – Software: garbage collection, virtualization – Algorithmic: Calculating objectives and stopping conditions

Stragglers in BSP Slow thread(s) will hold up entire application Predictable stragglers  Easy case Unpredictable stragglers  ??? – Hardware: disk seeks, network latency, CPU interrupts – Software: garbage collection, virtualization – Algorithmic: Calculating objectives, stopping conditions Don’t synchronize

Don’t synchronize? Well, don’t synchronize much – Read old (stale) results from other threads – Application controls how stale the data can be Machine learning can get away with that Algorithms are convergent – Given (almost) any state, will find correct solution – Errors introduced by staleness are usually ok

Trajectories of points in 2d Points are initialized randomly, Always settle to correct locations

Freshness and convergence: the sweet spot Fresh reads/writes Stale reads/writes

Freshness and convergence: the sweet spot Fresh reads/writes Stale reads/writes Iterations per second

Freshness and convergence: the sweet spot Fresh reads/writes Stale reads/writes Iterations per second Improvement per iteration

Freshness and convergence: the sweet spot Fresh reads/writes Stale reads/writes Iterations per second Improvement per iteration Improvement per second

Freshness and convergence: the sweet spot Fresh reads/writes Stale reads/writes Iterations per second Improvement per second The sweet spot

Stale synchronous parallel Allow threads to continue ahead of others – Avoids temporary straggler effects Application can limit allowed staleness – Ensure convergence – E.g. “threads may not be more than 3 iters ahead”

SSP Visualized Threads proceed, possibly using stale data Thread

Total convergence time I Increased staleness can mask the effects of occasional delays

Ongoing work Characterizing “staleness-tolerant” algorithms – Properties of algorithms, rules of thumb – Convergence proof Automatically tune freshness requirement Specify freshness by error bounds – “Read X with no more than 5% error”

Summary Introducing staleness, but not too much staleness, can improve performance of machine learning algorithms.