Download presentation
Published byConnor Shellito Modified over 9 years ago
1
Varys Efficient Coflow Scheduling Mosharaf Chowdhury,
Good morning. I’m Mosharaf from the AMPLab. Today, I’m going to talk about how to perform application-aware network scheduling in data-parallel clusters using coflows. This is a joint work with … UC Berkeley Mosharaf Chowdhury, Yuan Zhong, Ion Stoica
2
Performance Communication is Crucial As in-memory systems proliferate,
Facebook analytics jobs spend 33% of their runtime in communication1 Communication is crucial for analytics at scale. For example, in our earlier work, we found that typical data-parallel jobs at Facebook spend up to a third of their running time in shuffle or intermediate data transfers. As in-memory systems proliferate and disks are removed from the I/O pipeline, data-parallel applications will spend more and more time communicating data over the network. These are, of course, well-known concerns. The real question is, how do we go about optimizing communication performance? As in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
3
“Let systems figure it out”
Optimizing Communication Performance: Networking Approach “Let systems figure it out” A sequence of packets between two endpoints Flow Independent unit of allocation, sharing, load balancing, and/or prioritization So far, most of the research in the networking community has been done in the context of the flow abstraction. A flow between two endpoints is just a sequence of packets, and it is the unit of allocation, load balancing, and traffic engineering. While this abstraction served us well for client-server or peer-to-peer communications, it is a poor fit for data-parallel applications.
4
“Let users figure it out”
Optimizing Communication Performance: Systems Approach “Let users figure it out” # Comm. Params* Spark 1.0.1 6 Hadoop 1.0.4 10 YARN2.3.0 20 Because we only have the application-agnostic flow abstraction, system designers expose myriad of parameters trying to optimize the performance of data-parallel applications. Examples include, the number of parallel flows, the size of send/receive buffers, bytes in flight, and many others. To get a sense of how many parameters you might want to tune, a few weeks ago we looked at three of the most commonly used data-parallel frameworks. And, this is only a lower bound. There are many other parameters that indirectly impact the communication performance of data-parallel jobs. These parameters are hard to understand, difficult to use, and they don’t always provide expected results. *Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters.
5
Optimizing Communication Performance: Systems Approach Optimizing
“Let users figure it out” Optimizing Communication Performance: Networking Approach “Let systems figure it out” A collection of parallel flows Completion time depends on the last flow to complete Distributed endpoints Obviously, there is a massive gap between the networking and data-parallel applications perspectives on how to optimize the communication performance of data-parallel applications. Our goal is to bridge this gap. If we look at a data-parallel application, it is a sequence of computation and communication stages. If we focus on one of these communication stages, a shuffle in this case, we see that… Communication is highly structured. Each communication stage consists of a collection of parallel flows that have their endpoints located on different machines Furthermore, each flow is independent in that the input of one flow does not depend on another flow of the same communication stage. Finally, a communication stage cannot complete until all its flows have completed. As a result, completing flows really fast and minimizing flow completion times might not result in any application-level improvements. Each flow is independent
6
Completion time depends on the last flow to complete
A collection of parallel flows Distributed endpoints Each flow is independent Completion time depends on the last flow to complete Coflow1 We refer to such collections of parallel flows as coflows. And, any data-parallel DAG typically consists a sequence of coflows of different patterns, which include many-to-many, many-to-one, one-to-many, all-to-all, and even a collection of parallel flows where each flow has distinct endpoints. Note that each individual flow is also a coflow. 1. Coflow: A Networking Abstraction for Cluster Applications, HotNets’2012
7
Completion time depends on the last flow to complete
How to schedule coflows … … for faster #1 completion of coflows? … to meet #2 more deadlines? 1 2 N . A collection of parallel flows Distributed endpoints Each flow is independent Coflow1 Completion time depends on the last flow to complete Now, of course, a data-parallel cluster is a shared pool of resources. Many coflows, from many different jobs and frameworks, can coexist on the shared network. If we consider all the machines in the cluster to be connected by a shared network fabric with N input and output ports, the question we want to answer is how to schedule all these coflows? We want to schedule coflows either to minimize coflow completion times, which is more directly correlated to job completion times. We also want to schedule coflows to maximize the number of coflows that complete within their deadlines. In this talk, I’ll focus on making coflows complete faster. DC Fabric
8
Varys Enables coflows in data-intensive clusters Simpler Frameworks
Zero user-side configuration using a simple coflow API Better performance Faster and more predictable transfers through coflow scheduling In this work, we present Varys, a system allows any data-parallel framework to take advantage of coflows to enable simpler configuration for users and better performance of user jobs. Varys requires minor changes to the cluster computing frameworks, but requires no changes to the applications While we had addressed the problem of scheduling individual coflows in the past, Varys is about efficiently scheduling multiple coflows.
9
Inter-Coflow Scheduling
Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 3-ε Units Fair Sharing Flow-level Prioritization1,2 The Optimal time 2 4 6 time 2 4 6 time 2 4 6 L2 L2 L2 But first, let us see the potentials of inter-coflow scheduling through a simple example. We have two coflows: coflow1 in black with one flow on link1 and coflow2 with two flows. Each block represent a unit of data. Assume, it takes a unit time time to send each unit of data. Let’s star with considering what happens today. In this plot, we have time in the X-axis and the links on Y-axis. Link1 will be almost equally shared between the two flows from two different coflows and the other link will be used completely by coflow2’s flow. After, 6 time units, both coflows will finish. Recently, there has been a lot of focus on minimizing flow completion times by prioritizing flows of smaller size. In that case, the orange flow in link1 will be prioritized over the black flow. After 3 time units, the orange flow will finish. Note that coflow2 hasn’t finished yet, because it still has 3 more data units from its flow on link2. Eventually, when all flow finishes, the coflow completion times remain the same even thought the flow completion time has improved. The optimal solution for minimizing application-level performance, in this case, would be to let the black coflow finish first. As a result, we can see application-level performance improvement for coflow1 without any impact on the other coflow. In fact, it is quite easy to show that significantly decreasing flow completion times might still not result in any improvement in user experience. L1 L1 L1 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Coflow1 comp. time = 3 Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
10
Inter-Coflow Scheduling
Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 3-ε Units time 2 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Fair Sharing L1 L2 time 2 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Flow-level Prioritization1 L1 L2 time 2 4 6 The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 Concurrent Open Shop Scheduling1 Tasks on independent machines Examples include job scheduling and caching blocks Use a ordering heuristic The inter-coflow scheduling problem has its root in the concurrent open shop scheduling problem. In fact, data-parallel job scheduling or memory caching all fall under the umbrella of this problem. Like many other scheduling problems, it is also an NP-Hard problem and the solution approach is similar to other scheduling heuristics. One must come up with an ordering of coflows and the effectiveness of the ordering heuristic determines the goodness of the solution. Now, unlike this simplistic example, flows do not run on independent links in a cluster. 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013. 1. A note on the complexity of the concurrent open shop problem, Journal of Scheduling, 9(4):389–396, 2006
11
Inter-Coflow Scheduling
Link 1 Link 2 6 Units 3 Units 3-ε Units 3 2 1 Ingress Ports (Machine Uplinks) Egress Ports (Machine Downlinks) DC Fabric Concurrent Open Shop Scheduling1 Tasks on independent machines Examples include job scheduling and caching blocks Use a ordering heuristic So, we must bring the network fabric into the picture. In this example, we have 3-machine datacenter fabric. Each machine’s incoming and outgoing links are shown are the ingress and egress ports of the fabric. Each input port has virtual output queues for each output port. So the flows on link1 in our example may be going from input port 1 to output port 1 and the flow in link2 is going from input port 2 to output port 2. In essence, the performance a coflow depends on its allocation on both input and output ports of the fabric. In fact, we have discovered a brand-new scheduling problem called the concurrent open shop scheduling with coupled resources problem. We have provide the first characterization of this problem and we have proven that due to the unique matching constraints observed in the network, this is one of rare scheduling problems where Graham’s list scheduling method won’t work. 6 3 3-ε 1. A note on the complexity of the concurrent open shop problem, Journal of Scheduling, 9(4):389–396, 2006
12
Inter-Coflow Scheduling
is NP-Hard Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 3-ε Units 3 2 1 Ingress Ports (Machine Uplinks) Egress Ports (Machine Downlinks) DC Fabric with coupled resources Concurrent Open Shop Scheduling Flows on dependent links Consider ordering and matching constraints ^ So, we must bring the network fabric into the picture. In this example, we have 3-machine datacenter fabric. Each machine’s incoming and outgoing links are shown are the ingress and egress ports of the fabric. Each input port has virtual output queues for each output port. So the flows on link1 in our example may be going from input port 1 to output port 1 and the flow in link2 is going from input port 2 to output port 2. In essence, the performance a coflow depends on its allocation on both input and output ports of the fabric. In fact, we have discovered a brand-new scheduling problem called the concurrent open shop scheduling with coupled resources problem. We have provide the first characterization of this problem and we have proven that due to the unique matching constraints observed in the network, this is one of rare scheduling problems where Graham’s list scheduling method won’t work. 6 Characterized COSS-CR Proved that list scheduling might not result in optimal solution 3 3-ε
13
Varys Employs a two-step algorithm to minimize coflow completion times
Ordering heuristic Keeps an ordered list of coflows to be scheduled, preempting if needed Allocation algorithm Allocates minimum required resources to each coflow to finish in minimum time We propose a simple two-step greedy scheduling algorithm. First we order the coflows, and then we allocate rates to individual flows. In the next few slides, I’ll go over this two-step algorithm.
14
Ordering Heuristic : SEBF Smallest- Shortest-First Effective-
C1 ends C2 ends C2 ends C1 ends 1 4 1 P1 P1 2 2 P2 P2 2 4 P3 P3 3 3 3 5 9 4 9 4 Time Time C1 C2 Length 3 4 Width 2 Size 5 12 Bottleneck Shortest-First Smallest- Effective- Bottleneck- First Let’s take another example to demonstrate the pros and cons of different ordering schemes for the first step of our algorithm. In this example, we have two coflows: coflow1 in grey with two flows and coflow2 in orange with three flows. If we consider the typical shortest-first heuristic for coflows, and we define the length of a coflow the length of its longest flow, we see that coflow1 has length 3 and coflow2 has length 4. Let’s bring back the fabric again into the picture and see what happens when we choose the order coflow1followed by coflow2. In this example, we see that both grey flows are going to output port 1. As a result, they will be competing with each other and it will take at least 5 units of time for coflow1 to finish. Orange flows will make some progress in the time, but the one going to output port1 must wait until the grey flows have finished. All in all, coflow2 will finish after 9 time units. The total coflow completion time is 14 in this case. Now, instead of selecting length as the ordering criterion, we could also select the width of a coflow, i.e., the number of its flows, or the total size of a coflow. <EXPLAIN why these could be good choices.> In both cases, coflow1 should be scheduled first. However, observe that the completion of a coflow depends only on its bottleneck. And for coflow1’s bottleneck, the output port1 must receive 5 units of data, whereas coflow2’s bottlenecks have to receive 4 data units each. Hence, we propose the smallest-effective-bottleneck-first heuristic. In this case, coflow2 is scheduled first and finishes after 4 time units. And then, coflow1 is scheduled for a total CCT of 13 time units. Although the improvements seem smaller for this particular example, as the number of concurrent jobs increase we get more opportunities to better order and our competitive advantage increases. Note that in the single link case, this reduces to classic SRTF, just as a coflow reduces to a flow. Narrowest-First Smallest-First
15
Ordering Heuristic Allocation Algorithm : SEBF Smallest-
9 P3 P2 Time P1 C2 ends C1 ends 5 9 P3 P2 Time P1 C1 ends C2 ends 4 1 2 3 4 C1 C2 Length 3 4 Width 2 Size 5 12 Bottleneck Shortest-First Narrowest-First Smallest-First Smallest- Effective- Bottleneck- First <ANIMATION>
16
Allocation Algorithm MADD
Ensure minimum allocation to each flow for it to finish at the desired duration; for example, at bottleneck’s completion, or at the deadline. MADD Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time A coflow cannot finish before its very last flow Once we have determined an order of coflows, the next step is to determine the rate of its individual flows. Recall, that a coflow cannot finish until all its flows have finish. Which means that finishing flows earlier than a coflow’s bottleneck cannot decrease its completion time. Hence, we propose the Minimum Allocation for Desired duration algorithm that allocates minimum rates to each flows of a coflow so that they either finish together with the bottlenecks or they all finish at the deadline, if a deadline is provided. Note that we must ensure that capacity isn’t wasted between the ordering and rate allocation steps. I’ll skip the details of work conserving backfilling in this talk. The bigger question, how can you use coflows.
17
Varys Enables frameworks to take advantage of coflow scheduling
Varys provides a simple API to allow any data-parallel framework to express their communication requirements, and Varys enforces coflow scheduling through centralized architecture. The details are in the paper, but I want to stress that the API requires changes only in the framework. ALL user jobs, already-written or the new ones, do not require ANY changes at all. Currently, we assume that the flow sizes are known, which is true for many common frameworks that write their intermediate data to disk. Exposes the coflow API Enforces through a centralized scheduler
18
A 3000-node trace-driven simulation matched against a 100-node EC2 deployment
Evaluation Does it improve performance? Can it beat non-preemptive solutions? YES Varys is written in about 4500 lines of Scala, and we have deployed it on a 100-node large-memory EC2 instances for performance evaluation. Each of these machines had 1Gbps NICs. You can find more details in the paper, but I want to touch some high-level points in next two slides. These are obvious questions: Does it improve performance? At what cost, meaning, will it starve? The short answer to all three question is “yes, it can.”
19
Avg. 95th Faster Jobs 3.16X 1.85X 2.50X 1.25X 3.84X 1.74X 2.94X 1.15X
Comm. Heavy1 Comm. Improv. Job Improv. Avg. 3.16X 1.85X 2.50X 1.25X In our EC2 experiments we have observed that, in comparison to per-flow fair sharing, Varys improved the communication performance of data-intensive jobs by up to 1.85X and that of corresponding jobs by 1.25X. This is because many jobs are not communication-intensive. For communication-intensive jobs, improvements are even higher. NOTE: Time in Comm. <25% % % >74 % Jobs % 13% % % We have also found that recent techniques to optimize FCTs can beat Varys, but they are significantly worse than Varys in optimizing application-level performance. In fact, they are worse than TCP at minimizing coflow completion times. 95th 3.84X 1.74X 2.94X 1.15X 1. 26% jobs spend at least 50% of their duration in communication stages.
20
Better than Non-Preemptive Solutions
w.r.t. FIFO1 NO What About Perpetual Starvation Avg. 5.65X Because Varys preempts large coflows with smaller ones to minimize the average CCT, there is a risk of starvation. One easy way to avoid starvation is to use FIFO scheduling. However, due to head-of-line blocking, FIFO-based schemes can be excessively bad, and Varys easily outperformed the FIFO-based schemes proposed several years ago. What about starvation? We observe no perpetual starvation in our experiments due to two reasons: To avoid starvation, we make sure that every coflow makes progress by giving it a small share. Second one is a conjecture. Interestingly in our experiments this mechanism rarely kicked in. It has been shown that SRTF doesn’t cause significant starvation when task sizes follow heavy-tailed distributions. While we haven’t yet been able to prove a similar result for coflows, we do observe heavy-tailed distribution in coflow sizes and bottlenecks, and empirically, we haven’t found that instances where we had to invoke our starvation prevention mechanism. 95th ? 7.70X 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
21
#1 #2 #3 Coflow Dependencies Unknown Flow Information Four Challenges
Multi-stage jobs Multi-wave stages #2 Unknown Flow Information Pipelining between stages Task failures and restarts Four Challenges #3 Decentralized Varys Master failure Low-latency analytics General-purpose coflow scheduling raises many challenges, and at least four remain unanswered in this work. First, how to handle dependencies between multiple stages of a data-parallel job and multiple waves of a data-parallel job. Second, how to emulate smallest-first scheduling without knowing flow sizes or the number of flows. Third, how to handle master failure and support low latency coflows? Baraat addresses all three in the context of tasks, i.e., in the context of point-to-multipoint or multipoint-to-point coflows. For data-parallel jobs, a more scenario is multipoint-to-multipoint scenario where matching plays an important role in addition to ordering. in the Context of Multipoint-to-Multipoint Coflows
22
“Concurrent Open Shop Scheduling with Coupled Resources”
#4 Theory Behind “Concurrent Open Shop Scheduling with Coupled Resources” We have introduced the COSS-CR problem, shown it to be strongly NP-had, and proved that unlike many scheduling problems, there exist some algorithm better than Graham’s list scheduling approach for this problem. Just last month, group led by Prof Yuan Zhong in the Columbia University’s operations research department have found the first polynomial-time constant factor approximation algorithm for COSS-CR. While the algorithm has high time-complexity and the bound is loose (64/3), it gives an upper bound on how well we can do and it introduces new techniques to analyzing this new scheduling problem. As theory community picks it up, more and more results might follow.
23
Varys Greedily schedules coflows without worrying about flow-level metrics Consolidates network optimization of data-intensive frameworks Improves job performance by addressing the COSS-CR problem Increases predictability through informed admission control In conclusion, Varys enables general-purpose data-parallel frameworks to transparently take advantage of coflows and improves job-level performance and predictability. We have discovered a new scheduling problem and opened up a new avenue of scheduling research with some exciting results. Mosharaf Chowdhury
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.