Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.

Slides:

Advertisements

Similar presentations

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

CPU Scheduling Questions answered in this lecture: What is scheduling vs. allocation? What is preemptive vs. non-preemptive scheduling? What are FCFS,

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

GridFTP: File Transfer Protocol in Grid Computing Networks

A SYSTEM PERFORMANCE MODEL CSC 8320 Advanced Operating Systems Georgia State University Yuan Long.

Operating Systems 1 K. Salah Module 2.1: CPU Scheduling Scheduling Types Scheduling Criteria Scheduling Algorithms Performance Evaluation.

Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Performance Evaluation

What we will cover…  CPU Scheduling  Basic Concepts  Scheduling Criteria  Scheduling Algorithms  Evaluations 1-1 Lecture 4.

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

GridFTP Guy Warner, NeSC Training.

Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.

Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.

Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.

Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.

Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.

Scheduling. Alternating Sequence of CPU And I/O Bursts.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.

Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.

OPERETTA: An Optimal Energy Efficient Bandwidth Aggregation System Karim Habak†, Khaled A. Harras‡, and Moustafa Youssef† †Egypt-Japan University of Sc.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

1 Real-Time Scheduling. 2Today Operating System task scheduling –Traditional (non-real-time) scheduling –Real-time scheduling.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Static Process Scheduling

Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,

Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,

1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.

GridFTP Guy Warner, NeSC Training Team.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

OPERATING SYSTEMS CS 3502 Fall 2017

CPU SCHEDULING.

Chapter 2 Scheduling.

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Chapter 6: CPU Scheduling

Process Scheduling B.Ramamurthy 11/18/2018.

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

3: CPU Scheduling Basic Concepts Scheduling Criteria

Raj Kettimuthu, Gagan Agrawal, P. Sadayappan, Ian Foster

Modeling and Optimizing Large-Scale Wide-Area Data Transfers

Raj Kettimuthu, Gayane Vardoyan, Gagan Agrawal, P

Smita Vijayakumar Qian Zhu Gagan Agrawal

Chapter 6: CPU Scheduling

CPU SCHEDULING.

Chapter 6: CPU Scheduling

Lecture 2 Part 3 CPU Scheduling

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan

Exploding data volumes 100,000 TB MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB Pan-STARRS: 40,000 TB 2004: 36 TB 2014: 3,300 TB 10 5 increase in data volumes in 6 years AstronomyClimate Genomics

Data movement Data Transfer Node Storage

Current work  Understand characteristics, control and optimize transfers  Efficient scheduling of wide-area transfers  Model – predict and control throughput –Characterize, identify key features –Data-driven modeling using experimental data  Adaptive scheduling –Algorithm to minimize slowdown –Experimental evaluation using real transfer logs

 High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks  Parallel TCP streams, PKI security for authentication, integrity and encryption, checkpointing for transfer restarts  Based on FTP protocol - defines extensions for high- performance operation and security  Globus implementation of GridFTP is widely used.  Globus GridFTP servers support usage statistics collection –Transfer type, size in bytes, start time of the transfer, transfer duration etc. are collected for each transfer GridFTP 5

GridFTP usage log

Parallelism vs concurrency in GridFTP Data Transfer Node at Site B Parallel File System Data Transfer Node at Site A Parallel File System Parallelism = 3 TCP Connection GridFTP Daemon GridFTP Daemon GridFTP Client 2811 GridFTP Server GridFTP Server GridFTP Server GridFTP Server TCP Connection Concurrency = 2 Control channel Control channel

Parallelism vs concurrency

 Objective - control bandwidth allocation for transfer(s) from a source to the destination(s)  Most large transfers between supercomputers –Ability to both store and process large amounts of data  Site heavily loaded, most bandwidth consumed by small number of sites  Goal – develop simple model for GridFTP –Source concurrency - total number of ongoing transfers between the endpoint A and all its major transfer endpoints –Destination concurrency - total number of ongoing transfers between the endpoint A and the endpoint B –External load - All other activities on the endpoints including transfers to other sites Model throughput and control bandwidth allocation

Modeling throughput  Linear models  Model dest throughput (DT) using source & destination CC  Data to train, validate models – load variation experiments  Errors >15% for most cases  Log models Y’ = a 1 X 1 + a 2 X 2 + … + a k X k + b DT = a 1 *DC + a 2 *SC + b 1 DT = a 3 *DC/SC + b 2 log(DT)=a4*log(SC) + a5*log(DC) + b3

Modeling throughput  Log model better than linear models, still high errors  Model based on just SC and DC too simplistic  Incorporate external load –External load - network, disk, and CPU activities outside transfers –How to measure the external load? –How to include external load in model(s)?

External load  Multiple training data – same SC, DC - different days & times  EL - Throughput differences for same SC, DC  Three different functions for external load (EL) –EL1=T −AT, T - throughput for transfer t, AT - average throughput of all transfers with same SC, DC as t –EL2=T−MT, MT - max throughput with same SC, DC as t –EL3 = T/MT EL a11 if EL>0 |EL| (−a11) otherwise AEL{a11} = DT = a6*DC + a7*SC + a8*EL + b4 DT = SCa9 * DCa10 * AEL{a11} * 2b5 Linear Log

Models with external load DT = a6*DC + a7*SC + a8*EL + b4 PredictControllable Uncontrollable  Unlike SC and DC, external load is uncontrollable  Train models – multiple data points with same SC, DC  In practice, some recent transfers possible but all combinations of SC, DC unlikely

Calculating external load in practice DT = a6*DC + a7*SC + a8*EL + b4 Known Compute Transfers in past 30 minutes DT = a6*DC + a7*SC + a8*EL + b4 + e Historic transfers Previous Transfer Method Recent Transfers Method Recent Transfers with Error Correction

Applying models to control bandwidth  Find DC, SC to achieve target throughput  Limit DC to 20 to narrow search space –Even then, large number of possible DC combinations (20 n )  SCmax (max source concurrency allowed) is the number of possible values for SC –Heuristics to limit search space to SCmax * #destinations DT = a6*DC + a7*SC + a8*EL + b4 PredictGiven Known (Compute w/ PT, RT or RTEC) DT = a6*DC + a7*SC + a8*EL + b4 Give n Compute Known (Compute w/ PT, RT or RTEC)

Experimental setup TACC NCAR SDSC Indiana NICS PSC

Experiments  Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio  Available bandwidth at stampede is 9 Gbps  2:1:2:3:3 for Kraken, Mason, Blacklight, Gordon, Yellowstone Kraken = 2*9Gbps/( ) = 2*9Gbps/9 = 2Gbps Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps Kraken=2Gbps, Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps Kraken=3Gbps, Mason=X 1 Gbps, Blacklight=X 2 Gbps, Gordon=X 3 Gbps, Yellowstone=X 4 Gbps  Factoring experiments – increase destination’s throughput by a factor when source is saturated

Results – Ratio experiments Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}. Model: log with EL1. Method: RTEC Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,4,3,1,1}. Model: log with EL3. Method: RT

Results – Factoring experiments Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5 Increasing Yellowstone’s baseline throughput by 1.5x. Concurrency picked by picked by Algorithm for Yellowstone was 3

Adaptive scheduling of data transfers Data Transfer Node Storage

Adaptive scheduling of data transfers

 Bursty transfers  opportunity for adaptive scheduling  Goals - optimize throughput, improve response times  Challenge – adaptive concurrency –Low load – increase CC (unsaturated destinations) to max. utilization –New requests  queue or adjust ongoing transfer concurrency  Data transfer scheduling analogous to parallel job scheduling? –Data transfers ≅ compute jobs. wide-area bandwidth ≅ compute resources, transfer concurrency ≅ job parallelism  CPU, storage network different at source, destination  Shared wide area network  Scheduling wide-area data transfers challenging –Heterogenous resources, shared network, dynamic nature of load –Scheduling decisions not based on resource availability at one site

Metrics Turnaround time – time a job spends in the system: completion time - arrival time Job slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time Bounded slowdown in parallel job scheduling Bounded slowdown for wide-area transfers Job priority for wide-area transfers

Scheduling algorithm  Maximize resource utilization and reduce slowdown –Adaptively queue and adjust concurrency based on load  Preemption/restart –State required is missing block information & No migration –Still overhead (auth, checkpoint restart), p-factor limits preemption  Four key decision-making points –Upon task arrival – schedule or queue –If scheduled, what concurrency value? –When to preempt (and schedule a waiting job) –When to change concurrency of a running job  Use both models and recent observed behavior –Models to predict throughput and determine concurrency value –5-second averages of observed throughput to determine saturation

Illustrative example Average turnaround time is Average turnaround time for baseline is 12.04

Workload traces  Traces from actual executions –Anonymized GridFTP usage statistics  Busiest day from a 1 month period  Busiest server log on that day  Limit length of logs due to production environment  Three 15-minute logs - 25%, 45%, and 60% load traces –“load” is total bytes transferred / max. that can be transferred  Destination anonymized in logs –Weighted random split based on capacities

Experimental results – turnaround 60% load

Experimental results – worst case 60% load

Experimental results – 60% load improved baseline

Related work  Several models for predicting behavior & finding optimal parallel TCP streams –Uncongested networks, simulations  Many studies on bandwidth allocation at router –Our focus is application-level control  Adaptive replica selection, algorithms to utilize multiple paths –Ability to control network path –Overlay networks  Workflow schedulers - dependencies between computation and data movement  Adaptive file transfer scheduling w/preemption in production environments not studied

Summary of current work  Models for wide-area data transfer throughput in terms of few key parameters  Log models that combine total source CC, destination CC, and a measure of external load are effective  Methods that utilize both recent and historical experimental data better at estimating external load  Adaptive scheduling algorithm to improve the overall user experience  Evaluated it using real traces on a production system  Significant improvements over the current state-of-the-art

Proposed work  File transfers have different time constraints –Near real time to highly flexible  Objective – account time requirements to improve overall user experience  Consider 2 job types – batch and interactive –First, exploit relaxed deadlines of batch jobs –Next, exploit knowledge about future arrival times  Finally, maximize utility value for jobs –Each job has a utility function

Batch jobs  If deadline closer, batch jobs get highest priority –Scheduled with a concurrency of 2, no preemption  Otherwise, batch jobs get lowest priority  Interactive jobs measured by turnaround and slowdown, batch jobs measured by deadline satisfaction rate

Knowledge about future jobs T1 (d2) T2 (d1) T3 (d2) T1 (d2) T2 (d1 ) T3 (d2 ) Wait queue Schedule A – no knowledge of future jobs 4 5 T1 (d2) T2 (d1) T3 (d2) Schedule B – w/ knowledge of future jobs T1 – 1GB, T2 – 1GB Source – 1GB/s Destination d1 – 1GB/s Destination d2 – 0.5GB/s T3 – 0.5GB Throughput in GB/s Time in Seconds Throughput in GB/s Time in Seconds Average Slowdown is ( )/3 = 1.5 Average Slowdown is (1+2+1)/3 = 1.33

Utility based scheduling  Both interactive and batch jobs have deadline  Associated utility function –Impact of missing the deadline  Decay – linear, exponential, step, or a combination  Each transfer request R defined by tuple, R = (d,A,S,D,U) –d = destination, –A = arrival time of R, –S = size of the file to be transferred, –D = deadline of R, and –U = utility function of R.  Objective – maximize aggregate utility value of jobs

Utility based scheduling  Inverse of instantaneous utility value as priority  Instantaneous utility value calculated as follows

Questions