Grid checkpointing in the European DataGrid Project Alessio Gianelle – INFN Padova Rosario Peluso – INFN Padova Francesco Prelz – INFN Milano Massimo Sgaravatto.

Slides:



Advertisements
Similar presentations
WP1 Grid Workload Management Massimo Sgaravatto INFN Padova
Advertisements

Grid Workload Management (WP 1) Report to INFN-GRID TB Massimo Sgaravatto INFN Padova.
Workload Management David Colling Imperial College London.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
Preparation of D1.7 Massimo Sgaravatto INFN Padova.
Job Submission The European DataGrid Project Team
David Colling Imperial College London Running your jobs everywhere.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
A conceptual model of grid resources and services Authors: Sergio Andreozzi Massimo Sgaravatto Cristina Vistoli Presenter: Sergio Andreozzi INFN-CNAF Bologna.
Status of Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The EDG Workload Management System – n° 1 The EDG Workload Management System.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
Job Submission The European DataGrid Project Team
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto - INFN Padova Francesco Prelz – INFN Milano.
“Grey areas” of the new architecture Massimo Sgaravatto INFN Padova.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
The Grid approach for the HEP computing problem Massimo Sgaravatto INFN Padova
CERN – Roberta Faggian Marque, Jan Fiete Grosse-Oetringhaus GRACE General Meeting, September 2004, Brussels 1 D6.1 Integration with the European DataGrid.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DataGrid Applications Federico Carminati WP6 WorkShop December 11, 2000.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards.
M. Sgaravatto – n° 1 The EDG Workload Management System: release 2 Massimo Sgaravatto INFN Padova - DataGrid WP1
DataGrid is a project funded by the European Union VisualJob Demonstation EDG 1.4.x 2003 The EU DataGrid How the use of distributed resources can help.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
The Plan for this morning: Description of the EDG WP 1 software: How it works, basic commands, how to get started etc Example of how to submit jobs: From.
Nadia LAJILI User Interface User Interface 4 Février 2002.
Ron Trompert – Testbed1 Software – 7 November n° 1 Partner Logo Testbed1 Software Ron Trompert sara.nl.
DataGrid is a project funded by the European Commission under contract IST rd EU Review – 19-20/02/2004 WP1 activity, achievements and plans.
Grid Workload Management Massimo Sgaravatto INFN Padova.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Job Submission The European DataGrid Project Team
The European DataGrid Project Team The EU DataGrid.
Job Submission and Resource Brokering WP 1. Contents: The components What (should) works now and configuration How to submit jobs … the UI and JDL The.
M. Sgaravatto – n° 1 Overview of WP1 Workload Management System in EDG 2.x Massimo Sgaravatto INFN Padova - DataGrid WP1
The Grid approach for the HEP computing problem Massimo Sgaravatto INFN Padova
JRA Execution Plan 13 January JRA1 Execution Plan Frédéric Hemmer EGEE Middleware Manager EGEE is proposed as a project funded by the European.
M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
EGEE is a project funded by the European Union under contract INFSO-RI Practical approaches to Grid workload management in the EGEE project Massimo.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.
4/9/ 2000 I Datagrid Workshop- Marseille C.Vistoli Wide Area Workload Management Work Package DATAGRID project Parallel session report Cristina Vistoli.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Summary of the EDG review Some info for the next future of the WP1 software Massimo Sgaravatto INFN Padova.
WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.
Job Submission The European DataGrid Project Team
EU 2nd Year Review – Feb – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )
Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova.
Workload Management Workpackage
BaBar-Grid Status and Prospects
The EDG Testbed Deployment Details
EGEE Middleware Activities Overview
Grid related projects CERN openlab LCG EDG F.Fluckiger
EGEE tutorial, Job Description Language - more control over your Job Assaf Gottlieb Tel-Aviv University EGEE is a project.
Job Submission in the DataGrid Workload Management System
WP1 activity, achievements and plans
The CERN openlab and the European DataGrid Project
The EU DataGrid Job Submission Services
Future EU Grid Projects
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Grid checkpointing in the European DataGrid Project Alessio Gianelle – INFN Padova Rosario Peluso – INFN Padova Francesco Prelz – INFN Milano Massimo Sgaravatto – INFN Padova

DataGrid Project (EDG) DataGrid goal: Grid software projects meet real-life scientific applications (High Energy Physics, Earth Observation, Biology) and their deadlines, with mutual benefit Bring the issues of data identification, location, transfer and access into the picture Middleware development and integration of existing middleware Large scale testbed Production quality demonstration Project started Jan 2001, duration 3 years 6 main partners: CERN, INFN (Italy), CNRS (France), NIKHEF (The Netherlands), PPARK (UK), ESA/ESRIN (Italy) and 15 associated partners (industrial as well) spread in all Europe

EDG WP1 (Grid Workload Management) Objective of the first DataGrid workpackage (according to the project "Technical Annex"): To define and implement a suitable architecture for distributed scheduling and resource management on a GRID environment Implemented a first workload management system “Super scheduling" component (Resource Broker, RB) using application data and computing elements requirements Deployed in the EDG testbed and used for real activities Towards second major release of the workload management system Increased reliability New functionalities

First WMS dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;

Grid checkpointing Approach: providing users with a “trivial” logical job checkpointing service API User defines what is a state of a job Represents what the job has done until that moment pairs “Enough” to restart a computation from a previously saved state User can save from time to time the state of the job A job can be restarted from an intermediate (i.e. “previously” saved) job state The “first” instruction of the code should be the retrieval of the last saved state (if any), so that the job can restart from that point

Grid checkpointing API Job state represented as an object Data members are essentially the pairs Setting of pair error_t saveValue (const std::string &name, TYPED value) Sets a pair error_t appendValue (const std::string &name, TYPED value) Appends a TYPED value to an already set pair, or defines a new pair Resetting a job state void clearPairs (void) All pairs for the job state are deleted Saving a job state Error_t saveState(void) Saves persistently the job state

Grid checkpointing API Retrieving pairs from a job state Std::vector getTYPEDValue (const std::string &name) Retrieves the TYPED value(s) of a pair, given the var Bool isTYPEDValue (const std::string &name) Checks if the specified attribute if of TYPED type Retrieving a job state JobState *loadState (const std::string &stateID) Retrieves a job state (previously saved) given its identifier

How checkpointing is exploited A job is aborted due to a “Grid problem” Job automatic rescheduled (possibly on a different resource) and resubmitted; the last saved job state is automatically retrieved User wants to resubmit her job starting from a previous saved state (not necessarily the last one), for example because it didn’t finish as expected Possibility to retrieve a previously saved state, and submit the job specifying that this must be considered the initial job state Job partitioning Job “decomposed” in sub-jobs, which can be executed in parallel “Job aggregator” responsible to collect and “merge” the results of the sub-jobs (represented by their final states) to provide the overall results Job preemption/migration (e.g. higher priority jobs to be submitted first, etc.)

Implementation and status Job states saved in the EDG Logging & Bookkeeping Server Already in place and used as job information repository Implementation of job checkpointing on-going Deployment of job checkpointing scheduled by the end of the year

Other Info The European DataGrid Project DataGrid WP1 Job checkpointing (and partitioning) within EDG