Presentation is loading. Please wait.

Presentation is loading. Please wait.

Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova.

Similar presentations


Presentation on theme: "Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova."— Presentation transcript:

1 Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova

2 Changes in the doc. (wrt. prev. release) Removed files from job state Defined just by pairs Not possible to move files from sub-jobs to job aggregator with job partitioning They must be saved to a SE, and their identifiers specified as pairs in their final job states LB server used to persistently save the job states Removed chkpt-server Possibility to specify pre-job (besides job aggregator) in job partitioning

3 Changes in the doc. (wrt. prev. release) Two new functions added to API set_final_state To specify that the state is the last one is_final_state Is this state the last one (I.e. was it “marked” using the set_final_state method ?) ? Check if all the sub-jobs have saved their final states done by the job aggregator The job aggregator responsible to decide the policy (e.g. all sub-jobs had to save their final states, at least one sub-job had to save its final state, at least x % of sub- jobs had to save their final states, ….)

4 APIs Object State: { // Data Members Label_t state_id = ``label''; VarValueSet var_value_pairs[] = {``var1''=``value1'', ``var2''=``value2'',... }; StepsSet main_stepper = {``element1'', ``element2'', ``element3'',... }; Label_t current_step; // Methods int save_value(Pair); int save_state(); string get_string_value(string); int get_int_value(string); double get_double_value(string); State load_state(Label_t); Label_t get_next_step(); int set_final_state(); bool is_final_state(Label_t); }

5 Issues Specifications of JobSteps for the job aggregator Should be the identifiers of the final states of the sub-jobs Possible approach: sub-job’s state ids represented by sub- job’s dg-job-id  Necessary to know the dg-job-id’s of the sub-jobs given the dg-job-id of the original “partitionable” job (the dg-job-id associated to the DAG) Needed also to allow dg-get-job-chkpt for a partitionable job (dg-job-id of the partitionable job given as argument) Should return the states for its various sub-jobs Avoid that all sub-jobs are submitted to the same CE Same problem also when a bunch of jobs with same Requirements and Ranks are submitted together (EstimatedTraversalTime not promptly updated)

6 Next steps Some time (10 days ?) for other WP1 internal comments and then submit to WP8 TWG ? Definition of architecture with much more details Coordination with other teams, in particular CESNET (LB) and CNAF (DAGMAN)


Download ppt "Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova."

Similar presentations


Ads by Google