Download presentation
Presentation is loading. Please wait.
Published byDelilah Shelton Modified over 9 years ago
1
Grid checkpointing in the European DataGrid Project Alessio Gianelle – INFN Padova Rosario Peluso – INFN Padova Francesco Prelz – INFN Milano Massimo Sgaravatto – INFN Padova massimo.sgaravatto@pd.infn.it
2
DataGrid Project (EDG) DataGrid goal: Grid software projects meet real-life scientific applications (High Energy Physics, Earth Observation, Biology) and their deadlines, with mutual benefit Bring the issues of data identification, location, transfer and access into the picture Middleware development and integration of existing middleware Large scale testbed Production quality demonstration Project started Jan 2001, duration 3 years 6 main partners: CERN, INFN (Italy), CNRS (France), NIKHEF (The Netherlands), PPARK (UK), ESA/ESRIN (Italy) and 15 associated partners (industrial as well) spread in all Europe
3
EDG WP1 (Grid Workload Management) Objective of the first DataGrid workpackage (according to the project "Technical Annex"): To define and implement a suitable architecture for distributed scheduling and resource management on a GRID environment Implemented a first workload management system “Super scheduling" component (Resource Broker, RB) using application data and computing elements requirements Deployed in the EDG testbed and used for real activities Towards second major release of the workload management system Increased reliability New functionalities
4
First WMS dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;
5
Grid checkpointing Approach: providing users with a “trivial” logical job checkpointing service API User defines what is a state of a job Represents what the job has done until that moment pairs “Enough” to restart a computation from a previously saved state User can save from time to time the state of the job A job can be restarted from an intermediate (i.e. “previously” saved) job state The “first” instruction of the code should be the retrieval of the last saved state (if any), so that the job can restart from that point
6
Grid checkpointing API Job state represented as an object Data members are essentially the pairs Setting of pair error_t saveValue (const std::string &name, TYPED value) Sets a pair error_t appendValue (const std::string &name, TYPED value) Appends a TYPED value to an already set pair, or defines a new pair Resetting a job state void clearPairs (void) All pairs for the job state are deleted Saving a job state Error_t saveState(void) Saves persistently the job state
7
Grid checkpointing API Retrieving pairs from a job state Std::vector getTYPEDValue (const std::string &name) Retrieves the TYPED value(s) of a pair, given the var Bool isTYPEDValue (const std::string &name) Checks if the specified attribute if of TYPED type Retrieving a job state JobState *loadState (const std::string &stateID) Retrieves a job state (previously saved) given its identifier
8
How checkpointing is exploited A job is aborted due to a “Grid problem” Job automatic rescheduled (possibly on a different resource) and resubmitted; the last saved job state is automatically retrieved User wants to resubmit her job starting from a previous saved state (not necessarily the last one), for example because it didn’t finish as expected Possibility to retrieve a previously saved state, and submit the job specifying that this must be considered the initial job state Job partitioning Job “decomposed” in sub-jobs, which can be executed in parallel “Job aggregator” responsible to collect and “merge” the results of the sub-jobs (represented by their final states) to provide the overall results Job preemption/migration (e.g. higher priority jobs to be submitted first, etc.)
9
Implementation and status Job states saved in the EDG Logging & Bookkeeping Server Already in place and used as job information repository Implementation of job checkpointing on-going Deployment of job checkpointing scheduled by the end of the year
10
Other Info The European DataGrid Project http://www.edg.org DataGrid WP1 http://www.infn.it/workload-grid Job checkpointing (and partitioning) within EDG http://edms.cern.ch/document/347730 http://www.pd.infn.it/~gianelle/datagrid
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.