Rescheduling Sathish Vadhiyar
Rescheduling Motivation Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions in response to changes in resource performance is necessary Performance degradation of the running applications Performance degradation of the running applications Availability of “better” resources Availability of “better” resources
Modeling the Cost of Redistribution C threshold depends on: Model accuracy Model accuracy Load dynamics of the system Load dynamics of the system
Modeling the Cost of Redistribution
Redistribution Cost Model for Jacobi 2D E max – average iteration time of the processor that is farthest behind C dev – processor performance deviation variable
Redistribution Cost Model for Jacobi 2D
Experiments 8 processors were used A loading event consisting of parallel program was introduced 3 minutes after Jacobi started Number of tasks of the loading event varied C threshold – 15 seconds
Results
Malleable Jobs Parallel Jobs Rigid – only one set of processors Rigid – only one set of processors Moldable – flexible during job starts, but cannot be reconfigured during execution Moldable – flexible during job starts, but cannot be reconfigured during execution Malleable – flexible during job start as well as during execution Malleable – flexible during job start as well as during execution
Rescheduling in GrADS Performance-oriented migration framework Tightly coupled policies for suspension and migration Takes into account load characteristics, remaining execution times Migration of application depends on: The amount of increase or decrease in loads on the system The time of the application execution when load is introduced into the system The performance benefits that can be obtained due to migration Components: 1.Migrator 2.Contract Monitor 3.Rescheduler
SRS Checkpointing Library End application instrumented with user-level checkpointing library Enables reconfiguration of executing applications across distinct domains Allows fault tolerance Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpoints Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel application Simple API - SRS_Init() - SRS_Init() - SRS_Restart_Value() - SRS_Restart_Value() - SRS_Register() - SRS_Register() - SRS_Check_Stop() - SRS_Check_Stop() - SRS_Read() - SRS_Read() - SRS_Finish() - SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create()
SRS INTERNALS MPI Application SRS IBP Runtime Support System (RSS) Start Poll STOP Read with possible redistribution ReStart
SRS API /* begin code */ MPI_Init() /* initialize data */ loop{}MPI_Finalize() /* begin code */ MPI_Init()SRS_Init() restart_value = SRS_Restart_Value() if(restart_value == 0){ /* initialize data */ /* initialize data */}else{ SRS_Read(“data”, data, BLOCK, NULL) SRS_Read(“data”, data, BLOCK, NULL)} SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL) loop{ stop_value = SRS_Check_Stop() stop_value = SRS_Check_Stop() if(stop_value == 1){ if(stop_value == 1){ exit(); exit(); }}SRS_Finish()MPI_Finalize() Original code SRS Instrumented code
SRS Example – Original Code MPI_Init(&argc, &argv); MPI_Init(&argc, &argv); local_size = global_size/size; local_size = global_size/size; if(rank == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ for(i=0; i<global_size; i++){ global_A[i] = i; global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; iter_start = 0; for(i=iter_start; i<global_size; i++){ for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; proc_number = i/local_size; local_index = i%local_size; local_index = i%local_size; if(rank == proc_number){ if(rank == proc_number){ local_A[local_index] += 10; local_A[local_index] += 10; } } MPI_Finalize(); MPI_Finalize();
SRS Example – Modified Code MPI_Init(&argc, &argv); MPI_Init(&argc, &argv); SRS_Init(); SRS_Init(); local_size = global_size/size; local_size = global_size/size; restart_value = SRS_Restart_Value(); restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(restart_value == 0){ if(rank == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ for(i=0; i<global_size; i++){ global_A[i] = i; global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; iter_start = 0; } else{ else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);
SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); stop_value = SRS_Check_Stop(); if(stop_value == 1){ if(stop_value == 1){ MPI_Finalize(); MPI_Finalize(); exit(0); exit(0); } proc_number = i/local_size; proc_number = i/local_size; local_index = i%local_size; local_index = i%local_size; if(rank == proc_number){ if(rank == proc_number){ local_A[local_index] += 10; local_A[local_index] += 10; } } SRS_Finish(); SRS_Finish(); MPI_Finalize(); MPI_Finalize();
Components (Continued..) Contract Monitor: »Monitors the progress of the end application »Tolerance limits specified to the contract monitor »Upper contract limit – 2.0 »Lower contract limit – 0.7 »When it receives the actual execution time for an iteration from the application »calculates ratio between actual and predicted »Adds it to the average ratio »Adds it to the last_5_avg
Contract Monitor If average ratio > upper contract limit Contact rescheduler Contact rescheduler Request for rescheduling Request for rescheduling Receive reply Receive reply If reply is “SORRY. CANNOT RESCHEDULE” If reply is “SORRY. CANNOT RESCHEDULE” Calculate new_predicted_time based on last_5_avg and orig_predicted_time Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit prev_predicted_time = new_predicted_time
Contract Monitor If average ratio < lower contract limit Calculate new_predicted_time based on last_5_avg and orig_predicted_time Calculate new_predicted_time based on last_5_avg and orig_predicted_time Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit prev_predicted_time = new_predicted_time prev_predicted_time = new_predicted_time
Rescheduler A metascheduling service Operates in 2 modes When contract monitor requests for rescheduling – i.e. during performance degradation When contract monitor requests for rescheduling – i.e. during performance degradation Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling
Rescheduler Pseudo Code
Rescheduler pseudo Code
Application and Metascheduler Interactions User Resource Selection Requesting Permission Service Permission? Application Specific Scheduling Contract Development Contract Negotiator Contract Approved? Application Launching Problem parameters Initial list of machines Permission NO YES Abort Exit Get new resource information Application specific schedule Get new resource information NO YES Application Completion? Application Completed Wait for restart signal Application was stopped Problem parameters, final schedule Get new resource information
Rescheduler Architecture Application Launching Exit Application Completion? Application Completed Wait for restart signal Application was stopped Get new resource information Application Manager Application Contract Monitor Runtime Support System (RSS) Execution time Query for STOP signal Database Manager Rescheduler Request for migration Store STOP Send STOP signal Store RESUME
Static Rescheduling Cost Rescheduling Phase Time (seconds) Writing checkpoints 40 Waiting for NWS update 90 NWS retrieval time 120 Application-level scheduling 80 Other Grid overhead 10 Starting application 60 Reading checkpoints and data redistribution 500 Total900
Experiments and Results Rescheduling on request Different problem sizes of ScaLAPACK QR msc – fast machines; opus – slow machines Initial set of resources consisted of 4 msc and 8 opus machines The performance model always chose 4 msc machines for application run 5 minutes into the application run, artificial load is introduced on 4 msc machines The application migrated from UT to UIUC No rescheduling Rescheduling Rescheduler decided not to reschedule for size 8000.Wrong decision!
Rescheduling Depending on Amount of Load ScaLAPACK QR problem size – Load introduced 20 minutes after application start The amount of load was varied No rescheduling Rescheduling Rescheduler decided not to reschedule.Wrong decision!
Rescheduling Depending on Load Introduction Time ScaLAPACK QR problem size – Same load introduced at different points of application execution No rescheduling Rescheduling Rescheduler decided not to reschedule.Wrong decision!
Experiments and Results Opportunistic Rescheduling Two problems – - 1 st problem, size executing on 6 msc machines. - 1 st problem, size executing on 6 msc machines. - 2 nd problem of varying sizes. - 2 nd problem of varying sizes. 2nd problem introduced 2 minutes after the start of 1 st problem. Initial set of resources for the 2 nd problem consisted of 6 msc machines and 2 opus machines. Due to the presence of 1 st problem, the 2 nd problem had to use both the msc and opus machines, hence involved Internet bandwidth. After 1 st problem completes, the 2 nd problem can be rescheduled to use only the msc machines. Large problem No rescheduling Large problem No rescheduling Rescheduling
Dynamic Prediction of Rescheduling Cost The rescheduler, during rescheduling decision, contacts RSS and obtains data distributions of data Forms old and new data maps Based on maps and current NWS information, predicts redistribution cost
Dynamic Prediction of Rescheduling Cost Application started on: 4 mscs Application restarted on: 8 opus
References / Sources / credits Predicting the Cost of Redistribution in Scheduling by Gary Shao, Rich Wolski and Fran Berman Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Framework for the Grid”. Proceedings of The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), pp , May 2003, Tokyo, Japan. L. V. Kale, Sameer Kumar, and J. DeSouza A Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. See Cactus migration thorn See opportunistic migration by Huedo
JUNK !
GridWay Migration: When performance degradation happens When performance degradation happens When “better” resources are discovered When “better” resources are discovered When requirements change When requirements change Owner decision Owner decision Remote resource failure Remote resource failure Rescheduling done at discovery interval Performance degradation evaluator program executed at monitoring interval
Components Request manager Request manager Dispatch manager Dispatch manager Submission manager – prologing, submitting, canceling, epiloging Submission manager – prologing, submitting, canceling, epiloging Performance monitor Performance monitor Application specific components Resource selector Resource selector Performance degradation evaluator Performance degradation evaluator Prolog Prolog Wrapper Wrapper epilog epilog
Opportunistic Job Migration Factors Performance of new host Performance of new host Remaining execution time of application Remaining execution time of application Proximity of new resource to the needed data Proximity of new resource to the needed data
Dynamic Space sharing on clusters of non-dedicated workstations (Chowdhury et. al.) Dynamic reconfiguration – application level approach for dynamic reconfiguration of grid-based iterative applications
SRS Overhead Worst case Overhead – 15% Worst case SRS Overhead of all results – 36 %
SRS Data Redistribution Cost Started on – 8 MSCs Restarted on – 8 OPUS, 2MSCs
Grid Routine / Application Manager User Modified GrADS Architecture Resource Selector Performance Modeler Contract Developer App Launcher Contract Monitor Application MDS NWS Permission Service RSS Contract Negotiator Rescheduler Database Manager
Another approach: AMPI AMPI – MPI implementation on top of Charm++ Processes implemented as user-level threads Charm++ provides load balancing framework, migrates threads The load balancing framework accepts processor map Parallel job started on all processors in the system Allocates work to only processors in the processor map, i.e. threads/objects are assigned to processors in the processor map
Rescheduling When processor map changes Threads are migrated to new set of processors in the processor map Threads are migrated to new set of processors in the processor map Skeleton processes left behind in the vacated processors Skeleton processes left behind in the vacated processors A skeleton forwards messages to threads/objects previously housed in the processor A skeleton forwards messages to threads/objects previously housed in the processor New processor conveyed to load balancer framework by adaptive job scheduler
Overhead Shrink or expand time depends on: per-process data that has to be transferred per-process data that has to be transferred Number of processors involved Number of processors involved
Cost of skeleton process
CPU utilization by 2 Jobs
Adaptive Job Scheduler Variant of dynamic equipartitioning strategy Each job specifies min. and max. number of procs. that it can run on. The scheduler recalculates the number of procs. assigned to each running job Running jobs and new job are first assigned the minimum requirement The left over procs. are equally divided among all the jobs The new job is assigned to a queue if it cannot be allocated its minimum requirement
Scheduling Same strategy followed when jobs complete The scheduler conveys the decision by bit- vector to jobs Jobs do thread migration
Experiments 32 processor Linux cluster Job arrival by Poisson process Each job – a molecular dynamics (MD) program with 50,000 atoms with different number of iterations Number of iterations exponentially distributed Minimum number of procs., minpe – uniformly distributed between 1 and 64 maxpe – 64 Each experiment – 50 job arrivals
Results Load factor – mean arrival rate x (execution time on 64 processors)
Dynamic reconfiguration Ability to change number of processors during execution Condor like environment Respect ownerships of workstations Respect ownerships of workstations Provide high performance for parallel applications Provide high performance for parallel applications Dynamic reconfiguration also provides high throughput for the system