Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

Slides:

Advertisements

Similar presentations

Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley.

Advertisements

ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli University of Calif, Berkeley and Lawrence Berkeley National Laboratory SIGCOMM.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.

A Review of Flow Scheduling in Data Center Networks Mala Yadav*, Jay Shankar Prasad** School of Computer and Information Science MVN University, Palwal.

Cross-Layer Scheduling in Cloud Systems Hilfi Alkaff, Indranil Gupta, Luke Leslie Department of Computer Science University of Illinois at Urbana-Champaign.

Information-Agnostic Flow Scheduling for Commodity Data Centers

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

Practical TDMA for Datacenter Ethernet

Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.

Cross-Layer Scheduling in Cloud Computing Systems Authors: Hilfi Alkaff, Indranil Gupta.

Network Sharing Issues Lecture 15 Aditya Akella. Is this the biggest problem in cloud resource allocation? Why? Why not? How does the problem differ wrt.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

Capacity Scaling with Multiple Radios and Multiple Channels in Wireless Mesh Networks Oguz GOKER.

Copyright © 2011, Programming Your Network at Run-time for Big Data Applications 張晏誌指導老師：王國禎教授.

Budapest University of Technology and Economics Department of Telecommunications and Media Informatics Optimized QoS Protection of Ethernet Trees Tibor.

Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

Matchmaking: A New MapReduce Scheduling Technique

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

6 December On Selfish Routing in Internet-like Environments paper by Lili Qiu, Yang Richard Yang, Yin Zhang, Scott Shenker presentation by Ed Spitznagel.

Trading Structure for Randomness in Wireless Opportunistic Routing Szymon Chachulski, Michael Jennings, Sachin Katti and Dina Katabi MIT CSAIL SIGCOMM.

SERENA: SchEduling RoutEr Nodes Activity in wireless ad hoc and sensor networks Pascale Minet and Saoucene Mahfoudh INRIA, Rocquencourt Le Chesnay.

Logically Centralized? State Distribution Trade-offs in Software Defined Networks.

R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

-1/16- Maximum Battery Life Routing to Support Ubiquitous Mobile Computing in Wireless Ad Hoc Networks C.-K. Toh, Georgia Institute of Technology IEEE.

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.

CSIE & NC Chaoyang University of Technology Taichung, Taiwan, ROC

Puzzle You have 2 glass marbles Building with 100 floors

Yiting Xia, T. S. Eugene Ng Rice University

R-Storm: Resource Aware Scheduling in Storm

Corelite Architecture: Achieving Rated Weight Fairness

Architecture and Algorithms for an IEEE 802

Data Center Network Architectures

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Measurement-based Design

A Bandwidth-Efficient and Low-Latency Packet Assembly Strategy for Optical Burst Switching in Metro Ring Prasanna Krishnamoorthy, Andrea Fumagalli Optical.

ECE 544: Traffic engineering (supplement)

Managing Data Transfer in Computer Clusters with Orchestra

Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka

Improving Datacenter Performance and Robustness with Multipath TCP

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

A Study of Group-Tree Matching in Large Scale Group Communications

Improving Datacenter Performance and Robustness with Multipath TCP

ElasticTree Michael Fruchtman.

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Scheduling Jobs Across Geo-distributed Datacenters

PA an Coordinated Memory Caching for Parallel Jobs

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

Multi-Core Parallel Routing

Frank Yeong-Sung Lin (林永松) Information Management Department

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

湖南大学-信息科学与工程学院-计算机与科学系

MapReduce: Data Distribution for Reduce

ElasticTree: Saving Energy in Data Center Networks

Multi-hop Coflow Routing and Scheduling in Data Centers

Distributed Channel Assignment in Multi-Radio Mesh Networks

Henge: Intent-Driven Multi-Tenant Stream Processing

L12. Network optimization

Operating systems Process scheduling.

Frank Yeong-Sung Lin (林永松) Information Management Department

Lecture 17, Computer Networks (198:552)

Leonie Ahrendts, Sophie Quinton, Thomas Boroske, Rolf Ernst

Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,

Presentation transcript:

Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le Systems Research Group Distributed Protocols Research Group

Outline Introduction System Architecture Scheduling Algorithm Evaluation Summary

Multi-tenancy in MapReduce Clusters Jobs Users MapReduce Cluster Better ROI, high utilization. How to share resources? Network is the primary bottleneck.

Problem Statement How to schedule network traffic to improve completion time for MapReduce jobs?

Application-Awareness in Scheduling Job 1 Traffic Job 2 Traffic Link 1 6 units Link 2 3 units 2 units Fair Sharing1 Shortest Flow First2 Application Aware L1 L1 L1 L2 L2 L2 2 4 6 time 2 4 6 time time 2 4 6 Job 1 Completion time = 5 Job 1 Completion time = 5 Job 1 Completion time = 3 Job 2 Completion time = 6 Job 2 Completion time = 6 Job 2 Completion time = 6 5

Network-Awareness in Scheduling Path 1 N1 S1 S2 N3 N2 N4 Path 2 Job 1 Traffic Job 2 Traffic Path 1 3 units Path 2 3 units

Network-Awareness in Scheduling Job 1 Traffic Job 2 Traffic Path 1 3 units Path 2 3 units Network-Agnostic Network-Aware P1 P1 P2 P2 2 4 6 time 2 4 6 time Job 1 Completion time = 6 Job 1 Completion time = 3 Job 2 Completion time = 6 Job 2 Completion time = 6 Takeaway: Do not schedule interfering flows of concurrent jobs together

Related Work Traditional flow-scheduling PDQ [SIGCOMM ‘12], Hedera [NSDI ‘10] Only improve network-level metrics Application and Network-Aware Task Schedulers Cross-Layer Scheduling [IC2E 2015], Tetris [SIGCOMM ’14] Schedule tasks instead of network traffic Application-Aware traffic schedulers Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Unaware of network topology

Phurti: Contributions Improves Job Completion Time Fairness and Starvation Protection Scalable API Compatibility Hardware Compatibility

Outline Introduction System Architecture Scheduling Algorithm Evaluation Summary

Phurti Framework Hadoop Nodes N4 N5 N6 N1 N2 N3 Phurti Northbound API N4 N5 N6 N1 N2 N3 Phurti Scheduling Framework Southbound API SDN Switches S1 S2

Outline Introduction System Architecture Scheduling Algorithm Evaluation Summary

Phurti Algorithm – Intuition Job 1 Flows Job 2 Flows 1 3 1 3 2 4 2 4 Max. Sequential Traffic: 4 units Max. Sequential Traffic: 5 units P1 P1 P2 P2 2 4 6 time 4 time 2 6 Job 1 Completion time = 4 Job 2 Completion time = 5 Takeaway: Job completion time is determined by maximum sequential traffic.

Phurti Algorithm – Intuition (cont.) Job 1 Traffic 1 3 Max. Sequential Traffic: 4 units Job 2 Traffic Max. Sequential Traffic: 5 units 2 4 If Job 2 scheduled first If Job 1 scheduled first P1 P1 P2 P2 4 time 2 6 8 2 4 6 8 time Job 1 Completion time = 4 Job 1 Completion time = 8 Job 2 Completion time = 8 Job 2 Completion time = 5 Observation: It is better to schedule the job with smaller maximum sequential traffic first.

Phurti Algorithm Assign priorities to jobs based on Max Sequential Traffic N2 N3 s1 s2 s3 N4 N1 Let flows of the highest priority job transfer N1 N4 N1 N2 N3 N4 Let non-interfering flows of the lower priority jobs transfer Job Flow Size Max Seq. Traffic Priority J1 N1N4 2 LOW N4N1 J2 N2N3 1 HIGH Let other lower priority flows transfer at a small rate Latency Improvement Throughput Maximization Starvation Protection

Evaluation Baseline: Fair Sharing (Default in MapReduce) Testbed: 6 nodes, 2 SDN switches SWIM workload: workload generated from Facebook Hadoop trace Job Size Bin % of total jobs % of total bytes in shuffled data Small 62% 5.5% Medium 16% 10.3% Large 22% 84.2%

Job Completion Time 95% of jobs have better job completion time under Phurti. Negative values mean Phurti performs better.

Job Completion Time 13% improvement in 95th percentile job completion time showing starvation protection. Much better for smaller jobs since they typically have higher priority

Flow Scheduling Overhead Simulate a fat-tree topology with 128 hosts. Even in unlikely event of 100 simultaneous incoming flows, scheduling time is 4.5ms which is negligible scheduling overhead.

Flow Scheduling Overhead Scheduling time for a new flow with 10 ongoing flows in the network Scheduling overhead grows much slower than linear rate showing that it is scalable with increasing number of hosts.

Phurti vs Varys Simulate 128-hosts fat-tree topology with core network having 1x, 5x and 10x capacity compared to access links Outperforms Varys significantly when the core network has much less capacity (oversubscribed). Better than Varys in every case.

Phurti: Contributions Improves completion time for 95% of the jobs, decreases the average completion time by 20% for all jobs. Fairness and Starvation Protection. Improves tail job completion time by 13%. Scalable. Shown to scale to 1024 hosts and 100 simultaneous flow arrivals. API Compatibility Hardware Compatibility

Backup slides

Effective Transmit Rate 80% of jobs have effective transmit rate larger than 0.9 showing minimal throttling.