Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P

An Elegant Sufficiency: Load-Aware Differentiated Scheduling of Data Transfers
Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P. Sadayappan and Ian Foster

Data Deluge Light Source Facilities Cosmology Genomics Climate
Data deluge is happening in almost all science domains. For example, in cosmology DES telescope in Chile captures TBs of data per night, another cosmology project SKA will generate an exabyte every 13 days when it becomes operational in DOE light source facilities generate 10s of TBs of data per day now and is poised to increased by two orders of magnitude in the next few years. Genomic data sets are growing at a rate of 5X/year for the past several years. Climate data is projected to exceed hundreds of exabytes by 2020.

Data Need to be Moved Experimental or observational facility may not have large-scale storage Dark energy survey Computing power required for data analysis not available locally Light source facilities Specialized computing systems required for analysis Visualization of large data sets Data collected from multiple sources for analysis Common requirement in genomics Data replicated for efficient distribution, disaster recovery and other policy reasons Climate data, data produced at Large Hadron Collider Data collected in these and other science domains often need to be moved over the wide-area network for a variety of reasons. In case of DES, data collected by telescope in chile is moved to NCSA supercomputing facility in Illinois. As the data volumes increase, local compute resources are insufficient and the data needs to moved to a remote computing facility for analysis. In some cases, the simulation results have to moved to a remote cluster for visualization purposes. In some domains, data need to be collected from multiple sources for analysis, which is a common requirement in genomics. Or the data is replicated for efficient distribution, recovery or other reasons. For example, key climate data sets are mirrored at multiple sites for efficient distribution.

State of the Art Concurrent transfers often required to achieve high aggregate throughput Schedule each request immediately with fixed concurrency This approach has disadvantages Under heavy load completion times of all transfer tasks can suffer Low utilization when number of transfer is small Provide best-effort service for all transfers Need for efficient scheduling of multiple transfers Improve aggregate performance and performance of individual flows Differentiated service for different transfer types Concurrent file transfers are often required to achieve higher aggregate file transfer throughput on the network as well as parallel storage system. State of the art advanced transfer tools transfer services make use of concurrency but they schedule each request immediately with fixed concurrency. Second, not increasing concurrency when the number of pending transfers is small can reduce overall utilization. Furthermore these tools provide best-effort service for all transfers. There is a need for efficient scheduling of multiple transfers to improve both aggregate performance and performance of individual flows as well as to provide differentiated service for different transfer types and that is what we focus on in this dissertation.

SchEduler Aware of Load: SEAL
Our Contributions SchEduler Aware of Load: SEAL Controls scheduled load, maximize performance Scheduler TypE Aware and Load aware: STEAL Differential treatment of best-effort & batch jobs Specifically, our contributions are three new file transfer scheduling algorithms that address the different needs and scenarios in science environments that we looked at in the previous slides. First algorithm, SEAL is a load-aware scheduling algorithm that adapt transfer schedules and concurrency based on system load to maximize performance. Our second algorithm, STEAL, supports differential treatment of best-effort and batch transfers. Our third algorithm, RESEAL supports differential treatment of response-critical and best-effort transfers. In order to schedule file transfers efficiently, we need a mechanism to estimate and control transfer throughput. So, we also develop models for transfer throughput in terms of few key parameters, using a data driven approach.

SEAL Motivation – Shared Resources
the shared resources involved in an end-to-end transfer are often not managed explicitly as schedulable resources. Thus, alternative enforcement methods are required, based for example on admission control of requests rather than explicit resource allocation. Second, resources may be subject to arbitrary external load that is neither under our control nor directly visible to us; thus we must use different methods for determining resource availability, such as monitoring of historical and recent performance. Data Transfer Node Data Transfer Node SAN SAN Storage Storage

SEAL Motivation – Concurrency Trends
Third, we already noted that the aggregate bandwidth achieved over a network or at a parallel storage system is typically greater when multiple transfers occur at the same time. Thus, it can be advantageous to schedule multiple transfers at once—or, if a system is not saturated, to divide large files into multiple chunks and transfer them concurrently. But then, the benefits from such increased concurrency do not grow beyond a threshold; indeed, all transfers ultimately experience slowdown when concurrency is too high. So, we can not blindly increase concurrency.

SEAL Motivation – Load Varies Greatly
Analysis of wide-area transfer logs reveal that traffic is typically nonuniform, with bursts saturating system resources. So, any effective scheduling algorithm must be able to deal with widely varying load levels. We leverage all these properties of transfers in SEAL for efficient scheduling of a set of transfers.

SEAL Monitors external load and controls scheduled load to minimize average slowdown across all transfers Preempts and/or delays transfers to reduce average slowdown (for example, under heavy load) Increases concurrency for a file transfer if it can increase aggregate performance (for example, under low load) Metrics Turnaround time – completion time - arrival time Job slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time TTideal - Computed using the models described before Next, we define metrics for data transfers. In (compute) job scheduling, response time or turnaround time has long been used as a measure of scheduler quality. More recently, job slowdown has emerged as a more suitable measure. Job slowdown is the factor by which a job is slowed relative to the time it would take on an unloaded system. To limit the influence of extremely short jobs on the slowdown metric, slowdown of such jobs are measured relative to an interactive threshold or bound, rather than the actual runtime. In traditional parallel job scheduling, nodes are used in a dedicated fashion and thus the runtime of a job on a given number of nodes can typically be treated as fixed. The time it takes to move a file from source to destination can vary according to other loads on the system, as all resources involved are shared. Bounded slowdown in parallel job scheduling Bounded slowdown for wide-area transfers Est. transfer time under zero load

SEAL – Problem definition
Stream of file transfer requests: <source, source file path, destination, destination file path, file size, arrival time> Future requests not known a priori Hosts have different capabilities (CPU, memory, disk, SAN, network interfaces, WAN connection) Each host has a max concurrency limit Maximum achievable throughput differ for each <source host, destination host> pair Load at a source, destination, and intervening network vary over time Schedule transfers minimize average transfer slowdown We consider a stream of file transfer requests, each defined by a six-tuple: <source host, source file path, destination host, destination file path, file size, arrival time>. Requests arrive in an online fashion, i.e., future transfer requests are not known a priori. Hosts may have different capabilities (CPU, memory, disk speed, storage area network, network interfaces, WAN connection) and thus the maximum achievable end-to-end throughput may differ for each <source host, destination host> pair. Load at a source, destination, and intervening network may also vary over time, as may the achievable transfer rates between a source and destination. Each host (source or destination) has a limit on the number of concurrent transfers that it can support. The problem is to schedule transfers so as to minimize average transfer slowdown.

SEAL Algorithm – Main Ideas
Queues to bound concurrency in high-load situations Increases transfer concurrency during low-load situations Prioritizes transfers based on their expected slowdown Four key decisions Should a new transfer be scheduled or queued? If scheduled what concurrency should be used? When to preempt a transfer be preempted? When to change concurrency of an active transfer? Uses both the models and the observed performance of current transfers Est. transfer time under current load Est. transfer time under zero load SEAL schedules transfer requests adaptively based on the load. Queues transfers so as to bound concurrency during high-load situations and increases transfer concurrency during low-load situations. As SEAL aims to reduce the slowdown of jobs, it uses expected slowdown or xfactor of a job, to prioritize jobs. We define the expansion factor for a file transfer job, xfactorFT , as. Overall, it involves four decisions. In making these decisions, SEAL uses both the models described in and the observed performance of current transfers

Data Driven Models Combines historical data with a correction term for current external load Takes three pieces of input Signature for a given transfer Concurrency level Total known concurrency at source (“known load at source”) Total known concurrency at destination (“known load at destination”) File size Historical data Transfer concurrency, known loads, and observed throughput for the source-destination pair Signatures and observed throughputs from the most recent transfers for the source-destination pair It produces an estimated throughput as an output. Our model combines extensive historical data with a correction term that accounts for current external load. It takes three pieces of input. First, a signature for a given transfer, encompassing its concurrency level, total known concurrency at source (“known load at source”), and total known concurrency at destination (“known load at destination”). Second, historical data (transfer concurrency, known loads, and observed throughput) of past transfers for the source-destination pair corresponding to the given transfer. Third, information (signatures and observed throughputs) from the most recent transfers for the source-destination pair. It produces an estimated throughput as an output.

SEAL Algorithm – Main ideas
for task = W.peek() do if !saturated(src) and !saturated(dst) then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if if saturated(dst) then CLdst = FindTasksToPreempt(dst,task) Preempt tasks in CLsrc U CLdst and schedule task end for SEAL always schedules waiting transfers if neither source nor destination are saturated, with our models to determine the concurrency for any transfer that is scheduled. It determines endpoint saturation using both the observed performance and models. If the source or destination is saturated, SEAL interrupts one or more active transfer(s) to service waiting requests, if it can reduce overall average slowdown. Third, the algorithm dynamically increases the concurrency of ongoing transfers if two conditions are met. First, there must not be any queued transfers, and second, there is bandwidth available due to completion of transfers. We maintain a moving five-second average observed throughput for each transfer for this purpose. Increased concurrency results in a proportionately insignificant increase in estimated throughput on several active links involving that endpoint.

SEAL Algorithm - Illustration
One source and one destination Total bandwidth – 5Gbps Width – expected runtime Height – aggregate throughput Task 1 arrives at time t = 0. Since no other tasks are running. Suppose that of the available 5 Gbps bandwidth, a single task can utilize 4 Gbps. At time t = 1, task T2 arrives. T2 is immediately scheduled since the source and destination are not saturated. But the total available bandwidth is now split between the two tasks: 2.5Gbps for each, which results in higher TTloads and xfactors. At time t = 3, task T3 arrives, but is made to wait because the system is saturated. At time t = 5, T3’s its xfactor becomes large enough to preempt T1. At time t =10:76, T2 completes and T1 is re-scheduled. At time t = 11:4, T3 completes, and T1 has 1.1GB left to transfer, but now it can go back to 4Gbps transfer rate. T1 completes at time t = 13:6. The average turnaround time is then 10:92, for the adaptive scheme. A default baseline scheme that starts jobs as soon as they arrive would have the same behavior as the adaptive scheme until T3 arrives. T3 is scheduled upon arrival at time t = 2, and all tasks begin to transfer at a rate of 1.67Gbps. The baseline’s average turnaround is 12:04 Average turnaround time is 10.68 Average turnaround time for baseline is 12.03

SEAL Evaluation - Experimental setup
TACC SDSC NCAR PSC For our experiments, we use Stampede, a supercomputer at Texas Advanced Computing Center (TACC), as the source and five other major supercomputing centers as the destinations. Specifically, the destinations are Blacklight, a compute cluster at the Pittsburgh Supercomputing Center; Kraken, a supercomputer at the Universityof Tennessee; Gordon, a compute cluster at the San Diego Supercomputer Center (SDSC), Mason, a compute cluster at Indiana University, and Yellowstone, a supercomputer at the National Center for Atmospheric Research (NCAR). NICS Indiana

SEAL Evaluation - Workload traces
Traces from actual executions Anonymized GridFTP usage statistics Top 10 busy servers in a 1 month period Day most bytes transferred by those servers Busiest (among the 10) server log on that day Limit length of logs due to production environment Three 15-minute logs - 25%, 45%, 60% load traces total bytes transferred / max. that can be transferred Endpoints anonymized in logs Weighted random split based on capacities In order to allow repeatable experiments, we used real traces as workloads. We obtained these traces from the anonymized usage statistics that Globus GridFTP servers send to a usage collector. We obtained our traces by first selecting the 10 servers that transferred most bytes in a one month period, then picking the day in that month in which the most bytes were transferred by those servers, and finally using the log from the one server among those 10 that transferred the most bytes on that day. Since our execution environment is a production infrastructure in continuous use, we were limited in the length of our experiments. Thus, we selected from the chosen 24-hour log three 15-minute traces with different loads. We define load as the total volume of all file transfers in the 15-minute trace, divided by the maximum amount of data that the source can transfer in a 15-minute period. We looked at all non-overlapping 15-min windows in the 24-hour period and picked one with the same average load as the entire workload (25%). The coefficient of variation of 1-minute average concurrent transfers was approximately the same, too. We picked one that had the highest load (~60%), and one with ~45% load (which is in between 25% and 60%). The Globus usage collector does not record destination identifiers; to address this problem, we randomly split transfers among the five destinations, with weighted probability based on their capacities.

SEAL Evaluation – Turnaround Time 60% Load
We compare the performance of SEAL with that of three baseline algorithms. The first, BaseCC1, replicates the common current practice of scheduling each file transfer as it is submitted and using only parallel TCP streams to improve performance. The other two baseline versions, BaseCC2 and BaseCC4, use both parallel storage operations and parallel TCP streams for individual transfers, with static per-file concurrency settings of 2 and 4, respectively. For 10MB files, we use a concurrency (CC) of 1 for all algorithms, since splitting such small files can only hurt performance. We compared avg. slowdown and TAT as well as worst-case slowdown and TATs for different file size categories. This figure shows the avg. TAT values for the log with 60% load. SEAL performs better, than all baseline algorithms, both overall and for each individual file size category and trace. SEAL even performs better for 10MB tasks, for which all algorithms use the same parameters, because SEAL’s load awareness postpones larger tasks when load is high. Also, blindly using higher CC for >10MB tasks (along with scheduling all tasks upon arrival) hurts 10MB tasks (BaseCC4 is almost always worst there). Even though the priority for larger tasks (1-10G) grows relatively slowly in SEAL, those tasks still benefit from load-aware scheduling and dynamic adjustment of concurrency M and 100M-1G tasks benefit from relatively higher prioritization than 1-10G tasks, load-aware scheduling and dynamic scaling of concurrency. But the range of concurrency values for M tasks is limited as splitting these relatively small files into too many pieces can hurt the performance. Thus, 100M-1G tasks get the most benefit from SEAL. SEAL outperforms the best baseline algorithm by 25%.

SEAL Evaluation – Worst Case Performance 60% Load
In terms of worst-case slowdown and turnaround time, SEAL reduces them by as much as 70% for some categories. The factors that help SEAL perform consistently better include load awareness, prioritization of tasks based on their slowdowns, and sufficient but not excessive concurrency.

SEAL Evaluation - SEAL vs Improved Baseline – 60% Load
Looking at results of 60% trace, one might argue that a baseline algorithm that simply uses a different concurrency based on file size might perform close to SEAL. So, we compared SEAL with an improved baseline called BaseVary that uses different concurrency values for different file sizes. SEAL also outperforms BaseVary for all task categories: by >= 20% overall and >= 40% for 100M-1G.

STEAL Motivation - Transfers Have Different Needs
Certain transfer requests require best-effort service Some requests tolerate larger delays Science network requirements reports provide several use cases Replication often relatively time-insensitive Science data does not change rapidly Order of magnitude longer response time than average transfer time Migration due to changes in storage system availability Batch transfers STEAL – differential treatment of best-effort and batch transfers First, I will discuss motivation for STEAL. While certain transfer requests must be processed rapidly, others can tolerate larger delays. Recent reports on science network requirements give replication as a common, often relatively time-insensitive, reason for moving large quantities of data, whether for performance, fault tolerance, and/or preservation. For example, one of the reports describe replication use cases where a TB dataset must be delivered overnight. Because subsequent processing involves manual steps, there is no advantage in completing the transfer earlier. A terabyte of data can be transferred in under 45 minutes at 3 Gbps—a disk-to-disk WAN transfer rate that is commonly achieved between endpoint pairs in today’s HPC environment. Thus, transfer times can vary by at least an order of magnitude without compromising science goals. Another motivator for delay-tolerant large-scale data movement can be changes in storage system availability, such as a storage system reaching capacity or shutting down, or a storage allocation expiring. We classify those transfers for which it is acceptable for transfer times to be significantly (an order of magnitude) longer than average as batch transfers, and all other transfers as best-effort transfers. We will leverage the flexibility of batch transfers to improve the performance of interactive transfers, while also maximizing spare bandwidth utilization for batch transfers.

STEAL - Metrics Batch transfers Bi-objective problem
Acceptable to delay them Slowdown is not a suitable metric Use as much unused bandwidth as possible Bi-objective problem Maximize BB / (BT – BI) for Batch jobs BT – total BW, BI – BW used by BE jobs, BB – BW used by Batch jobs Maximize SDI /SDI+B for best-effort jobs SDI – average slowdown of best-effort jobs with no Batch jobs SDI+B– average slowdown of best-effort jobs with Batch jobs Next, I will define metrics for evaluating STEAL. Since batch transfers can tolerate longer delays, it is acceptable to delay them relative to interactive transfers. Thus, bounded slowdown by itself is not a suitable metric for batch transfers. Instead, we define a bi-objective scheduling problem. First, as we want batch transfers to use as much unused bandwidth as possible, we focus on the fraction of the spare bandwidth used by batch transfers. More specifically, if BT is the total bandwidth available, BI is the bandwidth consumed by interactive jobs, and BB is the bandwidth used for batch jobs, we aim to maximize BB/BT-BI. A scheduler can maximize BB/BT-BI just by prioritizing batch jobs. Thus we must introduce a second objective. Suppose that there are are no batch jobs in the system, and the average slowdown for interactive jobs is SDI . Next, the scheduler adds a set of batch jobs, but still prioritizes interactive jobs. Let the average slowdown for interactive jobs now be SDI+B. Our second objective is to maximize SDI/SDI+B, i.e., to achieve a value as close to 1 as possible.

STEAL Algorithm – Main Ideas
Priority for best-effort transfers Lower priority for batch tasks xfactorFT * Best-effort tasks preempt batch tasks before low-priority best-effort tasks Switch to best-effort when xfactor goes above a certain value No preemption of batch tasks by other batch tasks Improve bandwidth utilization for batch tasks Preemption by best-effort tasks help higher priority batch tasks We still prioritize batch tasks by xfactor, but multiply their xfactor values by a small fraction to lower their priority relative to interactive tasks. Best-effort tasks preempt batch tasks before low-priority best-effort tasks. To avoid indefinite delay, a batch task can be set to switch to interactive after a certain amount of time. To maximize the use of excess bandwidth by batch tasks, STEAL eliminates preemption of batch tasks by other batch tasks. Thus, a waiting batch task B2 cannot directly preempt a running batch task B1 even when B2’s priority becomes higher than B1. (Assuming no other spare capacity, B2’s next chance to run will thus be when B1 either completes or is pre-empted by an interactive task.)

STEAL Evaluation - Workload traces
Three 15-minute logs - 25%, 45%, and 60% load traces 60% high variation (60%-HV) trace with greater variation in the best-effort load Original tasks to be best-effort 50GB batch tasks to consume unused bandwidth Batch tasks available at start of schedule SEAL2: batch tasks xfactors increase as if they had arrived an hour before the start of the schedule We use four traces to evaluate STEAL: the 25%, 45%, and 60% traces described in SEAL plus a 60% high variation (60%-HV) trace with greater variation in the load due to interactive tasks. For each trace, we then defined the original tasks to be interactive and added enough 50GB batch tasks to consume the bandwidth unused by interactive tasks over the 15-minute duration. We make the batch tasks available to be scheduled right from the beginning of the 15-minute period. Because SEAL does not distinguish between interactive and batch tasks, the larger size of the batch tasks in our experiments means that their priorities (xfactors) will increase at a slower rate than that of interactive tasks (for same or similar endpoints) that arrive at the same time. Therefore, even under SEAL, batch jobs may often be preempted and/or queued until their wait time becomes high.

STEAL Evaluation – 60% Load
We evaluate STEAL for alpha = {1, 0.9, 0.8}, indicating that batch tasks can use up to {100%, 90%, 80%}, respectively, of the bandwidth unused by interactive jobs. We also evaluate BaseVary and two SEAL variants, SEAL1 and SEAL2, with the following motivation for the latter. Given that we cannot perform long experiments, we thus define SEAL1 and SEAL2 as follows. In SEAL1, batch tasks arrive at the start of the schedule for each 15-minute trace. In SEAL2, we increase xfactors as if the batch tasks had arrived an hour before the start of the schedule for 15-minute traces under consideration. Thus interactive tasks have an advantage in SEAL1 and batch tasks have an advantage in SEAL2. Note that the best performance is in the upper-right corner in these graphs. Figure 6 shows that STEAL (for different values) performs significantly better for interactive tasks, in terms of slowdown (yaxis), than SEAL and BaseVary, as it explicitly prioritizes interactive over batch. STEAL/=1 is also better than BaseVary, and comparable to SEAL1 and SEAL2, in its use of spare bandwidth for batch tasks (x-axis).

Related work and Summary
Parallel job scheduling extensively studied Less attention for file transfer scheduling, has significant differences Adaptive replica selection, algorithms to utilize multiple paths Ability to control network path, Overlay networks GARA – DS to schedule file transfers of differing priority Do not consider concurrency Two new scheduling algorithms for efficient online scheduling of wide area file transfers SEAL improves slowdown for transfers by adjusting CC based on load STEAL gives BW unused to batch transfers, nonintrusive to BE transfers Evaluated using real traces on a production system significant improvements over the state of the art The major difference between those studies and our work is that they focus on optimal number of streams to use to get the maximum throughput on the network. In comparison, we attempt to model the GridFTP transfer

Questions

Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P

Similar presentations

Presentation on theme: "Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P

Similar presentations

Presentation on theme: "Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P"— Presentation transcript:

Similar presentations

About project

Feedback