Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.

Slides:

Advertisements

Similar presentations

Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley.

Advertisements

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.

Varys Efficient Coflow Scheduling Mosharaf Chowdhury,

Coflow A Networking Abstraction For Cluster Applications UC Berkeley Mosharaf Chowdhury Ion Stoica.

Sharing Cloud Networks Lucian Popa, Gautam Kumar, Mosharaf Chowdhury Arvind Krishnamurthy, Sylvia Ratnasamy, Ion Stoica UC Berkeley.

Tail Latency: Networking

EECB 473 Data Network Architecture and Electronics Lecture 3 Packet Processing Functions.

Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.

Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

Endpoint Admission Control WebTP Presentation 9/26/00 Presented by Ye Xia Reference: L. Breslau, E. W. Knightly, S. Shenkar, I. Stoica, H. Zhang, “Endpoint.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

The Power of Choice in Data-Aware Cluster Scheduling

Cof low Mending the Application-Network Gap in Big Data Analytics

Practical TDMA for Datacenter Ethernet

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Michael Schapira Yale and UC Berkeley Joint work with P. Brighten Godfrey, Aviv Zohar and Scott Shenker.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

A Case for Performance- Centric Network Allocation Gautam Kumar, Mosharaf Chowdhury, Sylvia Ratnasamy, Ion Stoica UC Berkeley.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Scalable and Coordinated Scheduling for Cloud-Scale computing

Static Process Scheduling

Explicit Allocation of Best-Effort Service Goal: Allocate different rates to different users during congestion Can charge different prices to different.

Distributed Handler Architecture Beytullah Yildiz

XCP: eXplicit Control Protocol Dina Katabi MIT Lab for Computer Science

Queue Scheduling Disciplines

6.888 Lecture 8: Networking for Data Analytics Mohammad Alizadeh Spring  Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley)

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Efficient Coflow Scheduling with Varys

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Basic Concepts Maximum CPU utilization obtained with multiprogramming

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Networking in Datacenters EECS 398 Winter 2017

TensorFlow– A system for large-scale machine learning

Packing Tasks with Dependencies

Why Is There a Need to Implement a Weighted Fair Queue?

QoS & Queuing Theory CS352.

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

Introduction to Load Balancing:

Managing Data Transfer in Computer Clusters with Orchestra

Parallel Programming By J. H. Wang May 2, 2017.

Resource Elasticity for Large-Scale Machine Learning

Parallel Algorithm Design

PA an Coordinated Memory Caching for Parallel Jobs

Cof low A Networking Abstraction for Distributed

ICS 143 Principles of Operating Systems

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

CS 143A - Principles of Operating Systems

Module 5: CPU Scheduling

Apollo Weize Sun Feb.17th, 2017.

3: CPU Scheduling Basic Concepts Scheduling Criteria

CPU SCHEDULING.

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Cloud Computing Large-scale Resource Management

Module 5: CPU Scheduling

Introduction to Packet Scheduling

Tiresias A GPU Cluster Manager for Distributed Deep Learning

Relax and Adapt: Computing Top-k Matches to XPath Queries

Module 5: CPU Scheduling

Introduction to Packet Scheduling

Towards Predictable Datacenter Networks

Presentation transcript:

Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley

Communication is Crucial Performance As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster. Facebook jobs spend ~ 25% of runtime on average in intermediate comm. 1 Reduce Stage Map Stage

Flow-Based Solutions GPS RED WFQCSFQ ECNXCP D 2 TCP DCTCP PDQD3D3 FCP DeTailpFabric s1990s2000s RCP Per-Flow FairnessFlow Completion Time Independent flows cannot capture the collective communication patterns (e.g., shuffle) common in data-parallel applications

Cof low Communication abstraction for data- parallel applications to express their performance goals 1.Minimize completion times, 2.Meet deadlines, or 3.Perform fair allocation

Benefits of Link 1 Link 2 Inter-Coflow Scheduling

Benefits of time Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Fair Sharing Smallest-Flow FirstSmallest-Coflow First Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 L1 L2 L1 L2 Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units Inter-Coflow Scheduling Benefits increases with the number of coexisting coflows

Varys Efficiently schedules coflows leveraging complete and future information 1 1. Efficient Coflow Scheduling with Varys, SIGCOMM’ The size of each flow, 2.The total number of flows, and 3.The endpoints of individual flows

1.The size of each flow, 2.The total number of flows, and 3.The endpoints of individual flows  Pipelining between stages  Speculative executions  Task failures and restarts Varys Efficiently schedules coflows leveraging complete and future information

1.The size of each flow, 2.The total number of flows, and 3.The endpoints of individual flows  Pipelining between stages  Speculative executions  Task failures and restarts Aalo Efficiently schedules coflows without complete and future information

Coflow Scheduling Minimize Avg. Comp. TimeFlows on a Single Link With complete knowledgeSmallest-Flow-First Without complete knowledge Least-Attained Service (LAS)

Coflow Scheduling Minimize Avg. Comp. TimeFlows on a Single LinkCoflows in the Entire Network With complete knowledgeSmallest-Flow-FirstVarys 1, Smallest-Coflow-First 1 Without complete knowledge Least-Attained Service (LAS) ? 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

Coflow Scheduling Minimize Avg. Comp. TimeFlows on a Single LinkCoflows in the Entire Network With complete knowledgeSmallest-Flow-FirstVarys 1, Smallest-Coflow-First 1 Without complete knowledge Least-Attained Service (LAS) ? LAS: prioritize flow that has sent the least amount of data 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

Coflow-Aware LAS (CLAS) Prioritize coflow that has sent the least total number of bytes The more a coflow has sent, the lower its priority Smaller coflows finish faster

Coflow-Aware LAS (CLAS) Prioritize coflow that has sent the least total number of bytes The more a coflow has sent, the lower its priority Smaller coflows finish faster Challenges (also shared by LAS) Can lead to starvation Suboptimal for similar size coflows

Suboptimal for Similar Coflows Reduces to fair sharing Doesn’t minimize average completion time FIFO works well for similar coflows Optimal when cflows are identical Coflow 1Coflow 2 time Coflow1 comp. time = 6 Coflow2 comp. time = 6 time Coflow1 comp. time = 3 Coflow2 comp. time = 6

Between a “Rock” and a “Hard Place” Prioritize across dissimilar coflows FIFO schedule similar coflows

Discretized Coflow-Aware LAS (D- CLAS) Priority discretization Change priority when total # of bytes sent exceeds predefined thresholds Scheduling policies FIFO within the same queue Prioritization across queue Weighted sharing across queues Guarantees starvation avoidance FIFO … Highest- Priority Queue Lowest- Priority Queue QKQK Q2Q2 Q1Q1

How to Discretize Priorities? … A E A E K-1 Exponentially spaced thresholds: A×E i A, E : constants 1 ≤ i ≤ K : threshold constant K : number of the queues A E 2 0 ∞ Highest- Priority Queue Lowest- Priority Queue QKQK Q2Q2 Q1Q1 FIFO

Computing Total # of Bytes Sent D-CLAS requires to know total # of bytes sent over all flows of a coflow Distributed computation over small time scales  challenging

Computing Total # of Bytes Sent D-CLAS requires to know total # of bytes sent over all flows of a coflow Distributed computation over small time scales  challenging How much do we loose if we don’t compute total # of bytes sent? D-LAS: make decisions based on total number of bytes sent locally

D-LAS Far From Optimal! time Coflow1 comp. time = 6 Coflow2 comp. time = 6 D-LAS (decision on # of bytes sent locally) D-CLAS Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 L1 L2 Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

1.Implement D-CLAS using a centralized architecture 2.Expose a non-blocking coflow API Aalo Efficiently schedules coflows without complete and future information

Worker milliseconds Worker D-CLAS Sender 1 Sender 2 μsμs Network Interface Timescale Local/Global Scheduling Coordinator Aalo Architecture

Details Non-blocking: when a new coflow arrives at an output port Put its flow(s) in lowest priority queue and schedule them immediately No need to sync all flows of a coflow as in Varys

Details Non-blocking: when a new coflow arrives at an output port Put its flow(s) in lowest priority queue and schedule them immediately No need to sync all flows of a coflow as in Varys Compute total number of bytes sent Workers send info about active coflows periodically Coordinator computes total # of bytes sent, and relay this info back to workers Workers use this info to move coflows across queues Minimal overhead for small flows

1.Can it approach clairvoyant solutions? 2.Can it scale gracefully? YES Evaluation A 3000-machine trace- driven simulation matched against a 100-machine EC2 deployment

On Par with Clairvoyant Approaches [EC2] Varys Per-Flow 1.93X1.18X 0.89X0.91X Comm. Improv.Job Improv.

Performance Breakdown [EC2] Improvements for small coflows by avoiding coordination Performance loss for medium coflows by mischeduling them Similar for large coflows because they are in slow-moving queues Aalo Varys

What About Scalability? [EC2]

Aalo Efficiently schedules coflows without complete information Makes coflows practical in presence of failures and DAGs Improved performance over flow-based approaches Provides a simple, non-blocking API Mosharaf Chowdhury –