Rescheduling Sathish Vadhiyar. Rescheduling Motivation Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions.

Slides:



Advertisements
Similar presentations
Network Weather Service Sathish Vadhiyar Sources / Credits: NWS web site: NWS papers.
Advertisements

 Basic Concepts  Scheduling Criteria  Scheduling Algorithms.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
CPU Scheduling CS 3100 CPU Scheduling1. Objectives To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Resource Management of Grid Computing
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 5: CPU Scheduling.
Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
Scheduling in Batch Systems
Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Chapter 5-CPU Scheduling
A. Frank - P. Weisberg Operating Systems CPU Scheduling.
1 Distributed Systems: Distributed Process Management – Process Migration.
Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Basic Concepts Maximum CPU utilization.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Advanced / Other Programming Models Sathish Vadhiyar.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Nomadic Grid Applications: The Cactus WORM G.Lanfermann Max Planck Institute for Gravitational Physics Albert-Einstein-Institute, Golm Dave Angulo University.
October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.
Silberschatz and Galvin  Operating System Concepts Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor.
SERC Research Seminar Day August 18, 2007 Predictions for Parallel Applications and Systems Sathish Vadhiyar Grid Applications Research Laboratory (GARL)
Superscheduling and Resource Brokering Sven Groot ( )
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria.
Chapter 5: Process Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Basic Concepts Maximum CPU utilization can be obtained.
1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 5: CPU Scheduling Basic.
Faucets Queuing System Presented by, Sameer Kumar.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
Application-level Scheduling Sathish S. Vadhiyar Credits / Sources: AppLeS web pages and papers.
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
1 Module 5: Scheduling CPU Scheduling Scheduling Algorithms Reading: Chapter
Basic Concepts Maximum CPU utilization obtained with multiprogramming
1 Chapter 5: CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms.
lecture 5: CPU Scheduling
OpenPBS – Distributed Workload Management System
Chapter 6: CPU Scheduling
Introduction to Load Balancing:
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Operating Systems Processes Scheduling.
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Basic Grid Projects – Condor (Part I)
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Faucets: Efficient Utilization of Multiple Clusters
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

Rescheduling Sathish Vadhiyar

Rescheduling Motivation Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions in response to changes in resource performance is necessary Performance degradation of the running applications Performance degradation of the running applications Availability of “better” resources Availability of “better” resources

Modeling the Cost of Redistribution C threshold depends on: Model accuracy Model accuracy Load dynamics of the system Load dynamics of the system

Modeling the Cost of Redistribution

Redistribution Cost Model for Jacobi 2D E max – average iteration time of the processor that is farthest behind C dev – processor performance deviation variable

Redistribution Cost Model for Jacobi 2D

Experiments 8 processors were used A loading event consisting of parallel program was introduced 3 minutes after Jacobi started Number of tasks of the loading event varied C threshold – 15 seconds

Results

Malleable Jobs Parallel Jobs Rigid – only one set of processors Rigid – only one set of processors Moldable – flexible during job starts, but cannot be reconfigured during execution Moldable – flexible during job starts, but cannot be reconfigured during execution Malleable – flexible during job start as well as during execution Malleable – flexible during job start as well as during execution

Rescheduling in GrADS Performance-oriented migration framework Tightly coupled policies for suspension and migration Takes into account load characteristics, remaining execution times Migration of application depends on: The amount of increase or decrease in loads on the system The time of the application execution when load is introduced into the system The performance benefits that can be obtained due to migration Components: 1.Migrator 2.Contract Monitor 3.Rescheduler

SRS Checkpointing Library End application instrumented with user-level checkpointing library Enables reconfiguration of executing applications across distinct domains Allows fault tolerance Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpoints Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel application Simple API - SRS_Init() - SRS_Init() - SRS_Restart_Value() - SRS_Restart_Value() - SRS_Register() - SRS_Register() - SRS_Check_Stop() - SRS_Check_Stop() - SRS_Read() - SRS_Read() - SRS_Finish() - SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create()

SRS INTERNALS MPI Application SRS IBP Runtime Support System (RSS) Start Poll STOP Read with possible redistribution ReStart

SRS API /* begin code */ MPI_Init() /* initialize data */ loop{}MPI_Finalize() /* begin code */ MPI_Init()SRS_Init() restart_value = SRS_Restart_Value() if(restart_value == 0){ /* initialize data */ /* initialize data */}else{ SRS_Read(“data”, data, BLOCK, NULL) SRS_Read(“data”, data, BLOCK, NULL)} SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL) loop{ stop_value = SRS_Check_Stop() stop_value = SRS_Check_Stop() if(stop_value == 1){ if(stop_value == 1){ exit(); exit(); }}SRS_Finish()MPI_Finalize() Original code SRS Instrumented code

SRS Example – Original Code MPI_Init(&argc, &argv); MPI_Init(&argc, &argv); local_size = global_size/size; local_size = global_size/size; if(rank == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ for(i=0; i<global_size; i++){ global_A[i] = i; global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; iter_start = 0; for(i=iter_start; i<global_size; i++){ for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; proc_number = i/local_size; local_index = i%local_size; local_index = i%local_size; if(rank == proc_number){ if(rank == proc_number){ local_A[local_index] += 10; local_A[local_index] += 10; } } MPI_Finalize(); MPI_Finalize();

SRS Example – Modified Code MPI_Init(&argc, &argv); MPI_Init(&argc, &argv); SRS_Init(); SRS_Init(); local_size = global_size/size; local_size = global_size/size; restart_value = SRS_Restart_Value(); restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(restart_value == 0){ if(rank == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ for(i=0; i<global_size; i++){ global_A[i] = i; global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; iter_start = 0; } else{ else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); stop_value = SRS_Check_Stop(); if(stop_value == 1){ if(stop_value == 1){ MPI_Finalize(); MPI_Finalize(); exit(0); exit(0); } proc_number = i/local_size; proc_number = i/local_size; local_index = i%local_size; local_index = i%local_size; if(rank == proc_number){ if(rank == proc_number){ local_A[local_index] += 10; local_A[local_index] += 10; } } SRS_Finish(); SRS_Finish(); MPI_Finalize(); MPI_Finalize();

Components (Continued..) Contract Monitor: »Monitors the progress of the end application »Tolerance limits specified to the contract monitor »Upper contract limit – 2.0 »Lower contract limit – 0.7 »When it receives the actual execution time for an iteration from the application »calculates ratio between actual and predicted »Adds it to the average ratio »Adds it to the last_5_avg

Contract Monitor If average ratio > upper contract limit Contact rescheduler Contact rescheduler Request for rescheduling Request for rescheduling Receive reply Receive reply If reply is “SORRY. CANNOT RESCHEDULE” If reply is “SORRY. CANNOT RESCHEDULE” Calculate new_predicted_time based on last_5_avg and orig_predicted_time Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit prev_predicted_time = new_predicted_time

Contract Monitor If average ratio < lower contract limit Calculate new_predicted_time based on last_5_avg and orig_predicted_time Calculate new_predicted_time based on last_5_avg and orig_predicted_time Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit prev_predicted_time = new_predicted_time prev_predicted_time = new_predicted_time

Rescheduler A metascheduling service Operates in 2 modes When contract monitor requests for rescheduling – i.e. during performance degradation When contract monitor requests for rescheduling – i.e. during performance degradation Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling

Rescheduler Pseudo Code

Rescheduler pseudo Code

Application and Metascheduler Interactions User Resource Selection Requesting Permission Service Permission? Application Specific Scheduling Contract Development Contract Negotiator Contract Approved? Application Launching Problem parameters Initial list of machines Permission NO YES Abort Exit Get new resource information Application specific schedule Get new resource information NO YES Application Completion? Application Completed Wait for restart signal Application was stopped Problem parameters, final schedule Get new resource information

Rescheduler Architecture Application Launching Exit Application Completion? Application Completed Wait for restart signal Application was stopped Get new resource information Application Manager Application Contract Monitor Runtime Support System (RSS) Execution time Query for STOP signal Database Manager Rescheduler Request for migration Store STOP Send STOP signal Store RESUME

Static Rescheduling Cost Rescheduling Phase Time (seconds) Writing checkpoints 40 Waiting for NWS update 90 NWS retrieval time 120 Application-level scheduling 80 Other Grid overhead 10 Starting application 60 Reading checkpoints and data redistribution 500 Total900

Experiments and Results Rescheduling on request Different problem sizes of ScaLAPACK QR msc – fast machines; opus – slow machines Initial set of resources consisted of 4 msc and 8 opus machines The performance model always chose 4 msc machines for application run 5 minutes into the application run, artificial load is introduced on 4 msc machines The application migrated from UT to UIUC No rescheduling Rescheduling Rescheduler decided not to reschedule for size 8000.Wrong decision!

Rescheduling Depending on Amount of Load ScaLAPACK QR problem size – Load introduced 20 minutes after application start The amount of load was varied No rescheduling Rescheduling Rescheduler decided not to reschedule.Wrong decision!

Rescheduling Depending on Load Introduction Time ScaLAPACK QR problem size – Same load introduced at different points of application execution No rescheduling Rescheduling Rescheduler decided not to reschedule.Wrong decision!

Experiments and Results Opportunistic Rescheduling Two problems – - 1 st problem, size executing on 6 msc machines. - 1 st problem, size executing on 6 msc machines. - 2 nd problem of varying sizes. - 2 nd problem of varying sizes. 2nd problem introduced 2 minutes after the start of 1 st problem. Initial set of resources for the 2 nd problem consisted of 6 msc machines and 2 opus machines. Due to the presence of 1 st problem, the 2 nd problem had to use both the msc and opus machines, hence involved Internet bandwidth. After 1 st problem completes, the 2 nd problem can be rescheduled to use only the msc machines. Large problem No rescheduling Large problem No rescheduling Rescheduling

Dynamic Prediction of Rescheduling Cost The rescheduler, during rescheduling decision, contacts RSS and obtains data distributions of data Forms old and new data maps Based on maps and current NWS information, predicts redistribution cost

Dynamic Prediction of Rescheduling Cost Application started on: 4 mscs Application restarted on: 8 opus

References / Sources / credits Predicting the Cost of Redistribution in Scheduling by Gary Shao, Rich Wolski and Fran Berman Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Framework for the Grid”. Proceedings of The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), pp , May 2003, Tokyo, Japan. L. V. Kale, Sameer Kumar, and J. DeSouza A Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. See Cactus migration thorn See opportunistic migration by Huedo

JUNK !

GridWay Migration: When performance degradation happens When performance degradation happens When “better” resources are discovered When “better” resources are discovered When requirements change When requirements change Owner decision Owner decision Remote resource failure Remote resource failure Rescheduling done at discovery interval Performance degradation evaluator program executed at monitoring interval

Components Request manager Request manager Dispatch manager Dispatch manager Submission manager – prologing, submitting, canceling, epiloging Submission manager – prologing, submitting, canceling, epiloging Performance monitor Performance monitor Application specific components Resource selector Resource selector Performance degradation evaluator Performance degradation evaluator Prolog Prolog Wrapper Wrapper epilog epilog

Opportunistic Job Migration Factors Performance of new host Performance of new host Remaining execution time of application Remaining execution time of application Proximity of new resource to the needed data Proximity of new resource to the needed data

Dynamic Space sharing on clusters of non-dedicated workstations (Chowdhury et. al.) Dynamic reconfiguration – application level approach for dynamic reconfiguration of grid-based iterative applications

SRS Overhead Worst case Overhead – 15% Worst case SRS Overhead of all results – 36 %

SRS Data Redistribution Cost Started on – 8 MSCs Restarted on – 8 OPUS, 2MSCs

Grid Routine / Application Manager User Modified GrADS Architecture Resource Selector Performance Modeler Contract Developer App Launcher Contract Monitor Application MDS NWS Permission Service RSS Contract Negotiator Rescheduler Database Manager

Another approach: AMPI AMPI – MPI implementation on top of Charm++ Processes implemented as user-level threads Charm++ provides load balancing framework, migrates threads The load balancing framework accepts processor map Parallel job started on all processors in the system Allocates work to only processors in the processor map, i.e. threads/objects are assigned to processors in the processor map

Rescheduling When processor map changes Threads are migrated to new set of processors in the processor map Threads are migrated to new set of processors in the processor map Skeleton processes left behind in the vacated processors Skeleton processes left behind in the vacated processors A skeleton forwards messages to threads/objects previously housed in the processor A skeleton forwards messages to threads/objects previously housed in the processor New processor conveyed to load balancer framework by adaptive job scheduler

Overhead Shrink or expand time depends on: per-process data that has to be transferred per-process data that has to be transferred Number of processors involved Number of processors involved

Cost of skeleton process

CPU utilization by 2 Jobs

Adaptive Job Scheduler Variant of dynamic equipartitioning strategy Each job specifies min. and max. number of procs. that it can run on. The scheduler recalculates the number of procs. assigned to each running job Running jobs and new job are first assigned the minimum requirement The left over procs. are equally divided among all the jobs The new job is assigned to a queue if it cannot be allocated its minimum requirement

Scheduling Same strategy followed when jobs complete The scheduler conveys the decision by bit- vector to jobs Jobs do thread migration

Experiments 32 processor Linux cluster Job arrival by Poisson process Each job – a molecular dynamics (MD) program with 50,000 atoms with different number of iterations Number of iterations exponentially distributed Minimum number of procs., minpe – uniformly distributed between 1 and 64 maxpe – 64 Each experiment – 50 job arrivals

Results Load factor – mean arrival rate x (execution time on 64 processors)

Dynamic reconfiguration Ability to change number of processors during execution Condor like environment Respect ownerships of workstations Respect ownerships of workstations Provide high performance for parallel applications Provide high performance for parallel applications Dynamic reconfiguration also provides high throughput for the system