1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
Parasol Architecture A mild case of scary asynchronous system stuff.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI
A Computation Management Agent for Multi-Institutional Grids
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Parallelization and Grid Computing Thilo Kielmann Bioinformatics Data Analysis and Tools June 8th, 2006.
Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Christopher Jeffers August 2012
KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.
Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
LOGO Scheduling system for distributed MPD data processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Grid Computing I CONDOR.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Jean-Sébastien Gay LIP ENS Lyon, Université Claude Bernard Lyon 1 INRIA Rhône-Alpes GRAAL Research Team Join work with DIET TEAM D istributed I nteractive.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
LHCb Software Week November 2003 Gennady Kuznetsov Production Manager Tools (New Architecture)
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
SEE-GRID-SCI The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
HPC pilot code. Danila Oleynik 18 December 2013 from.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Programming with the DRMAA OGF Standard.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Next Generation of Apache Hadoop MapReduce Owen
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
GridWay Overview John-Paul Robinson University of Alabama at Birmingham SURAgrid All-Hands Meeting Washington, D.C. March 15, 2007.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Congreso Cuidad, Spain May 15, 2007 GridWay 1/27, Programming with the DRMAA OGF Standard GridWay Distributed Systems Architecture Group Universidad Complutense.
Interoperability & Standards
Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr
Introduction to Makeflow and Work Queue
gLite Job Management Christos Theodosiou
Overview of Workflows: Why Use Them?
Presentation transcript:

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain

2 Outline Problem Solution Architecture Implementation Examples

3 Application porting for distributed platforms

4 Problem Application should be highly portable Grid: Schedulers: WMS, GridWay, Pilot Jobs... Libraries: DRMAA, SAGA,... Cluster Schedulers: SGE, PBS/Torque,... Libraries: DRMAA, MPI

5 A new standard? (author: xkcd.com)

6 Solution: distributedToolbox

7 High level description Distributed tasks are defined with a reduced set of parameters and exported as XML files Executable, arguments, input/output/error files XML files are parsed and tasks executed on the distributed infrastructures Depending on the infrastructure, this can be done on very different ways

8 High level description (2) The basic idea is NOT to define a new standard, libraries, API... BUT Create a simple specification that anyone can implement according to their specific needs Extremely simple or rather complex!

9 Application developer’s point of view Java and python APIs are included to create distributed task definition files (XMLs), and to load information from XMLs If needed, others can be seamlessly implemented

10 DistributedToolbox Set of tools to execute distributed tasks Implementations for Cluster & Grid Can be modified or adapted to new platforms on a very simple way

11 Proposed solution for clusters

12 Proposed solution for the Grid

13 Execution workflow Local application creates task XMLs TaskLoader reads these files and stores them on a database GridController reads this database and executes the tasks employing GridWay A task is considered finished when the desired output files exist and are not null Local application loads results and finishes its execution

14 Robustness Certification problems. If the user is not able to properly identify himself by employing a valid Grid Certificate, GridWay will detect it and abort the task submission, notifying the problem. Communication failures. If any kind of problem on the transmission of the input data or task executable occurs, it is detected by GridWay on the remote site and the task is cancelled. If any kind of problem on the transmission of the output data occurs and this data is not returned to the local host, the task is considered to have failed.

15 Robustness (2) Local resource failures. If the specified input files are not present on the system, the job is considered as finished. If communication with GridWay is broken the task submission is stopped. When communication is restored, the status of the tasks being run is checked. If GridController fails, no information is lost due the employment of databases for persistence. When it is restarted, previous state is recovered and the status of the tasks that were running is checked. If the database fails, the execution of GridController is considered to be unsafe and automatically stops.

16 Robustness (3) Remote resource failures. If the remote task does not start, GridWay detects it. If the remote task remains in a queue for more than a given threshold, it is resubmitted. If there is any problem with the Grid certificates on the remote site, it is detected by GridWay. Some failures in remote sites lead to an state where the master node thinks that the task is running even if it was finished on the worker node. To detect this, tasks with an extremely long execution time are considered to have failed. In order to avoid performance slowdowns, a small replication factor for every group of tasks has been included. 16

17 Use Cases

18 ProtTest3 1 & jModelTest2 2 Java applications, designed to run on local workstations Wrappers of a serial application, PhyML, that takes 99% of the computational effort Large cases take days to weeks Porting to HPC & Grid necessary to improve throughput [1] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics, 2011 [2] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. jModelTest 2: more models, new heuristics and parallel computing. Nature Methods, 9(8):772–772, July

19 Architecture of the solution

20 Results: reliability tests Tests for certificate management: Submit jobs with no certificate Submit jobs with a certificate of a different VO Submit jobs finishing after the certificate Manually destroying the certificate

21 Results: reliability tests (2) Tests on local resource: Kill GridWay... or any number of GridWay tasks Kill GridController Kill database Shutdown machine, both controlled and “hard reset”

22 Results: reliability tests (3) Tests on remote sites: Jobs not creating the desired output data Many tasks submitted to fusion and Biomed VOs to test the proposal on production environments

23 Results Tasks executed: Cluster: about Grid: more than Not a single one was lost or miss-worked

24 Thanks for your attention Questions?