Presentation is loading. Please wait.

Presentation is loading. Please wait.

EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )

Similar presentations


Presentation on theme: "EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )"— Presentation transcript:

1 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 ) fabrizio.pacini@datamat.it

2 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 2 Middleware Demo Roadmap Data Management (WP2) Storage Element (WP5) Fabric Management (WP4) Networking (WP7) Information Service (WP3) Workload Management (WP1)

3 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 3 Job checkpointing u Checkpointing: saving from time to time job state n Useful to prevent data loss, due to unexpected failures n To allow job preemption n Also exploited in the job partitioning framework (see D1.4 for details) u Approach: provide users with a “trivial” logical job checkpointing service n User can save from time to time the state of the job (defined by the application) n A job can be restarted from an intermediate (i.e. “previously” saved) job state u Different than “classical’ checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.) n Very difficult to apply (e.g. problems to save the state of open network connections) n Not necessary for all the DataGrid reference applications s Sequential processing cases s The state of the application is represented by a small amount of information defined by the application itself

4 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 4 Job checkpointing example int main () { … for (int i=event; i < EVMAX; i++) { ;}... exit(0); } Example of Application (e.g. HEP MonteCarlo simulation)

5 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 5 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User code must be easily instrumented in order to exploit the checkpointing framework …

6 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 6 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User defines what is a state Defined as pairs Must be “enough” to restart a computation from a previously saved state

7 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 7 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User can save from time to time the state of the job

8 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 8 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } Retrieval of the last saved state The job can restart from that point

9 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 9 Job checkpointing Logging & Bookkeeping Server Saving of job checkpoint state state.saveState() Job Job checkpoint states saved in the LB server Retrieval of job checkpoint u Also used (even in rel. 1) as repository of job status info u Already proved to be robust and reliable u The load can be distributed between multiple LB servers, to address scalability problems

10 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 10 Demo u Purpose n To show how job checkpointing helps addressing and managing failures u Application used for demo n HEP application which fills an histogram n Application instrumented with WP1 checkpointing library s To save from time to time the intermediate state (number of events processed so far and pathname of intermediate histogram file) s To be able to restart its computation from a previously saved state u Scenario n Job submitted to a CE n When job runs it saves from time to time its state n Job failure (triggered simulating by hand a CE problem) n Job resubmitted by the WMS possibly to a different CE n Job restarts its computation from the last saved state s  No need to restart from the beginning s  The computation done till that moment is not lost

11 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 11 Testbed for this demo u UI n Running on a notebook here (Linux 6.2) u Other WMS services (NS, WM, JC, LB) n Running on a machine at INFN-CNAF, Bologna (Linux RH 6.2) u CEs n A notebook here (Linux RH 6.2): the one which will have a problem … n A LSF farm at INFN-Padova (Linux RH 6.2) n A PBS farm at INFN-Milano (Linux RH 7.3) n A PBS farm at CESNET-Prague (Debian 2.2)

12 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 12 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node Computing Element X Computing Element Y Logging & Bookkeeping Server

13 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 13 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog RB node submitted Job Status UI: allows users to access the functionalities of the WMS edg-job-submit jobchkpt.jdl jobchkpt.jdl [JobType = “Checkpointable”; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe”, OutputSandbox = “Outfile”, Requirements = member("ROOT", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Computing Element X Computing Element Y Logging & Bookkeeping Server Job Description Language (JDL) to specify job characteristics and requirements

14 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 14 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Input Sandbox files Job Input Sandbox files Job Match- maker Job Adapter 1 4 3 2 1 6 6 5

15 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 15 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job From time to time user’s job asks to save the intermediate state … ; State.saveValue(“var1”, value1>; … State.saveValue(“varn”, valuen); State.saveState(); …

16 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 16 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Logging & Bookkeeping Server Saving of job state Saving of intermediate files Computing Element X Computing Element Y Job

17 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 17 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running done (failed ) Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job fails (e.g. for a CE problem)

18 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 18 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker Reschedule and resubmit job submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server Job Where must this job be executed ? Possibly on a different CE where the job was previously submitted …

19 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 19 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker CE choice: CEy submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server

20 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 20 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node CE characts & status RB storage Job Status Job Adapter Computing Element X Computing Element Y Logging & Bookkeeping Server Job ready scheduled running done (failed ) waiting ready

21 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 21 Computing Element X Computing Element Y Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job Input Sandbox files ready scheduled running done (failed ) waiting ready scheduled Logging & Bookkeeping Server Job

22 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 22 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Retrieval of last saved state when job starts Retrieval of intermediate files (previously saved) scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job

23 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 23 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job keeps running starting from the point corresponding to the retrieved state (doesn’t need to start from the beginning)

24 EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 24 Summary u The Workload Management System was re-factored to streamline the flow of job information, therefore addressing problems and shortcomings found with release 1.x. u The re-factored components also provide hooks and features to support new functionality. u Among these, we chose to demonstrate Grid “logical” checkpointing, as it allows applications to achieve one very important degree of freedom over the Grid and is minimally intrusive. u The implemented Checkpointing API has been discussed with the DataGrid reference applications since June 2002, and was presented and well received in the GGF Grid Checkpointing WG.


Download ppt "EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )"

Similar presentations


Ads by Google