Download presentation
Presentation is loading. Please wait.
Published byEugene Horton Modified over 8 years ago
1
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 ) fabrizio.pacini@datamat.it
2
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 2 Middleware Demo Roadmap Data Management (WP2) Storage Element (WP5) Fabric Management (WP4) Networking (WP7) Information Service (WP3) Workload Management (WP1)
3
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 3 Job checkpointing u Checkpointing: saving from time to time job state n Useful to prevent data loss, due to unexpected failures n To allow job preemption n Also exploited in the job partitioning framework (see D1.4 for details) u Approach: provide users with a “trivial” logical job checkpointing service n User can save from time to time the state of the job (defined by the application) n A job can be restarted from an intermediate (i.e. “previously” saved) job state u Different than “classical’ checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.) n Very difficult to apply (e.g. problems to save the state of open network connections) n Not necessary for all the DataGrid reference applications s Sequential processing cases s The state of the application is represented by a small amount of information defined by the application itself
4
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 4 Job checkpointing example int main () { … for (int i=event; i < EVMAX; i++) { ;}... exit(0); } Example of Application (e.g. HEP MonteCarlo simulation)
5
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 5 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User code must be easily instrumented in order to exploit the checkpointing framework …
6
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 6 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User defines what is a state Defined as pairs Must be “enough” to restart a computation from a previously saved state
7
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 7 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User can save from time to time the state of the job
8
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 8 Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); ; … for (int i=event; i < EVMAX; i++) { ;... state.saveValue("first_event", i+1); ; state.saveValue("filename", PFN of file_on_SE);... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } Retrieval of the last saved state The job can restart from that point
9
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 9 Job checkpointing Logging & Bookkeeping Server Saving of job checkpoint state state.saveState() Job Job checkpoint states saved in the LB server Retrieval of job checkpoint u Also used (even in rel. 1) as repository of job status info u Already proved to be robust and reliable u The load can be distributed between multiple LB servers, to address scalability problems
10
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 10 Demo u Purpose n To show how job checkpointing helps addressing and managing failures u Application used for demo n HEP application which fills an histogram n Application instrumented with WP1 checkpointing library s To save from time to time the intermediate state (number of events processed so far and pathname of intermediate histogram file) s To be able to restart its computation from a previously saved state u Scenario n Job submitted to a CE n When job runs it saves from time to time its state n Job failure (triggered simulating by hand a CE problem) n Job resubmitted by the WMS possibly to a different CE n Job restarts its computation from the last saved state s No need to restart from the beginning s The computation done till that moment is not lost
11
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 11 Testbed for this demo u UI n Running on a notebook here (Linux 6.2) u Other WMS services (NS, WM, JC, LB) n Running on a machine at INFN-CNAF, Bologna (Linux RH 6.2) u CEs n A notebook here (Linux RH 6.2): the one which will have a problem … n A LSF farm at INFN-Padova (Linux RH 6.2) n A PBS farm at INFN-Milano (Linux RH 7.3) n A PBS farm at CESNET-Prague (Debian 2.2)
12
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 12 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node Computing Element X Computing Element Y Logging & Bookkeeping Server
13
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 13 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog RB node submitted Job Status UI: allows users to access the functionalities of the WMS edg-job-submit jobchkpt.jdl jobchkpt.jdl [JobType = “Checkpointable”; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe”, OutputSandbox = “Outfile”, Requirements = member("ROOT", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Computing Element X Computing Element Y Logging & Bookkeeping Server Job Description Language (JDL) to specify job characteristics and requirements
14
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 14 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Input Sandbox files Job Input Sandbox files Job Match- maker Job Adapter 1 4 3 2 1 6 6 5
15
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 15 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job From time to time user’s job asks to save the intermediate state … ; State.saveValue(“var1”, value1>; … State.saveValue(“varn”, valuen); State.saveState(); …
16
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 16 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Logging & Bookkeeping Server Saving of job state Saving of intermediate files Computing Element X Computing Element Y Job
17
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 17 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running done (failed ) Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job fails (e.g. for a CE problem)
18
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 18 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker Reschedule and resubmit job submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server Job Where must this job be executed ? Possibly on a different CE where the job was previously submitted …
19
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 19 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker CE choice: CEy submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server
20
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 20 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node CE characts & status RB storage Job Status Job Adapter Computing Element X Computing Element Y Logging & Bookkeeping Server Job ready scheduled running done (failed ) waiting ready
21
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 21 Computing Element X Computing Element Y Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job Input Sandbox files ready scheduled running done (failed ) waiting ready scheduled Logging & Bookkeeping Server Job
22
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 22 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Retrieval of last saved state when job starts Retrieval of intermediate files (previously saved) scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job
23
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 23 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job keeps running starting from the point corresponding to the retrieved state (doesn’t need to start from the beginning)
24
EU 2nd Year Review – 04-05 Feb. 2003 – WP1 Demo – n° 24 Summary u The Workload Management System was re-factored to streamline the flow of job information, therefore addressing problems and shortcomings found with release 1.x. u The re-factored components also provide hooks and features to support new functionality. u Among these, we chose to demonstrate Grid “logical” checkpointing, as it allows applications to achieve one very important degree of freedom over the Grid and is minimally intrusive. u The implemented Checkpointing API has been discussed with the DataGrid reference applications since June 2002, and was presented and well received in the GGF Grid Checkpointing WG.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.