Computer and Automation Research Institute Hungarian Academy of Sciences Automatic checkpoint of CONDOR-PVM applications by P-GRADE Jozsef Kovacs, Peter.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
SALSA HPC Group School of Informatics and Computing Indiana University.
1 Development of Virtual Supercomputer Service using Academic Network
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Distributed Processing, Client/Server, and Clusters
Hungarian GRID Projects and Cluster Grid Initiative P. Kacsuk MTA SZTAKI
Visual Solution to High Performance Computing Computer and Automation Research Institute Laboratory of Parallel and Distributed Systems
Client/Server Architecture
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Cracow Grid Workshop’10 Kraków, October 11-13,
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Chemistry GRID and its application for air pollution forecast Computer and Automation Research Institute of the Hungarian Academy of Sciences (MTA SZTAKI)
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Hungarian Supercomputing GRID
Windows 2000 Advanced Server and Clustering Prepared by: Tetsu Nagayama Russ Smith Dale Pena.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Cluster Programming Technology and its Application in Meteorology Computer and Automation Research Institute Hungarian Academy of Sciences Hungarian Meteorological.
Computer and Automation Research Institute Hungarian Academy of Sciences Presentation and Analysis of Grid Performance Data Norbert Podhorszki and Peter.
GRM + Mercury in P-GRADE Monitoring of P-GRADE applications in the Grid using GRM and Mercury.
Syzygy Design overview Distributed Scene Graph Master/slave application framework I/O Device Integration using Syzygy Scaling down: simulators and other.
Protein Molecule Simulation on the Grid G-USE in ProSim Project Tamas Kiss Joint EGGE and EDGeS Summer School.
Module 11: Implementing ISA Server 2004 Enterprise Edition.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
SALSA HPC Group School of Informatics and Computing Indiana University.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
1 Advanced features of the P-GRADE portal Peter Kacsuk, Gergely Sipos Peter Kacsuk, Gergely Sipos MTA.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
1 P-GRADE Portal: a workflow-oriented generic application development portal Peter Kacsuk MTA SZTAKI, Hungary Univ. of Westminster, UK.
Production Grid Challenges in Hungary Péter Stefán Ferenc Szalai Gábor Vitéz NIIF/HUNGARNET.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
Distributed System Services Fall 2008 Siva Josyula
How can Computer Networks Support National Grid Initiatives? Tamás Máray Péter Stefán NIIF/HUNGARNET.
The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no Workflow repository, user.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
The Hungarian ClusterGRID Project Péter Stefán research associate NIIF/HUNGARNET
1 P-GRADE Portal hands-on Gergely Sipos MTA SZTAKI Hungarian Academy of Sciences.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
1 Support for parameter study applications in the P-GRADE Portal Gergely Sipos MTA SZTAKI (Hungarian Academy of Sciences)
1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.
General Grid Monitoring Infrastructure (GGMI) Peter kacsuk and Norbert Podhorszki MTA SZTAKI.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
OpenMosix, Open SSI, and LinuxPMI
CLIF meets Jenkins Performance testing in continuous integration, and more... Bruno Dillenseger - Orange Labs CLIF is OW2's load testing framework project,
Grid Computing.
Workflow level parametric study support by the P-GRADE portal
Presentation transcript:

Computer and Automation Research Institute Hungarian Academy of Sciences Automatic checkpoint of CONDOR-PVM applications by P-GRADE Jozsef Kovacs, Peter Kacsuk Laboratory of Parallel and Distributed Systems MTA SZTAKI, Budapest, Hungary {smith,

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE Hungarian Ministry of Education, NIIF – procurement project to equip universities, high schools, public libraries with PC labs. More than 2000 PCs, which were considered to be enormous, computational resources had been spread over the country. Grid Technical Board – the goal was to build up a minimal, but functional grid system. Dual-boot PC labs are connected throughout the country. Day-time operation – Windows desktop use, night-time operation – grid mode use. 24 hours operational “grid backbone” infrastructure. Around 800 PCs are interconnected at 400 Gflops performance via private networking solution (MPLS VPN) over the academic network. 1 st generation ClusterGrid – a single large Condor pool 2 nd generation ClusterGrid – a Condor based grid connected by web service and transaction based. Background: The Hungarian ClusterGrid

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 3 Condor pools are connected by a global Grid Resource Broker which uses dynamic UID/GID mapping for user jobs, and “one job – one directory structure” job format. Scalable, easy to manage system. In production since July 2003 with more than real user jobs executed. Applications range from fundamental research (mathematics, physics) to applied research (biology, chemistry). –investigation of C60 molecule in electromagnetic fields –simulation of protein molecules –fractal calculation –investigation of imbalanced phase transitions –etc. Two classes of applications are currently supported: parameter scanning, and master-worker jobs parallelized by PVM. For more info, Hungarian ClusterGrid Infrastructure

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 4 Hungarian ClusterGrid Infrastructure

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 5 Motivation Checkpointing and migration support is necessary To enable load balancing and To support fault-tolerance To support day-night working mode of Hungarian ClusterGrid etc. Automatic checkpointing for sequential jobs in standard universe is provided by Condor Fault-tolerant execution of Master-Worker style parallel jobs are supported without automatic checkpointing With the P-GRADE environment Condor is able to make automatic checkpointing for PVM jobs to enable load-balancing and to make long running worker processes fault-tolerant

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 6 P-GRADE environment Parallel Grid Run-time and Application Development Environment

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 7 Using P-GRADE job mode for the whole range of parallel/distributed systems P-GRADE PVMMPIWorkflow Super- computers ClustersGrids CondorGridGT2 GridOGSA

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 8 P-GRADE and Condor

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 9 Current prototype for migration framework First prototype is currently based on –P-GRADE –Condor –PVM Requirements –No manual code preparation is required –No user interaction during execution –No PVM modification –No extra requirements from schedulers –Just build your application using P-GRADE

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 10 Structure of P-GRADE application Built-in server client A client B client D client C Server process spawn/terminate identification/topology access to terminal/files Clients identification of neighbors by the server access to files/terminal through the server primitives for communication messag e passing messag e passing Terminal Files

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 11 Checkpointing a single process 1.Initiate a checkpoint 2.Synchronize transit messages and disconnect MP 3.Collect address- space information 4.Send checkpoint 5.Store checkpoint onto server 6.Reconnect to MP User process Checkpoint Server Storage handle MP ckpt lib handle MP Vic Zandy’s single process checkpointer: © University of Wisconsin, Madison (former member of the Paradyn group)

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 12 Modified structure to checkpoint processes Server/ coordination module Client A Client D Client B message passing library Files Checkpoint Server Storage ckpt lib Terminal Client C ckpt lib user code comm lib mp lib

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 13 Migration among friendly condor pools Step 1: Starting the application S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 14 Step 2: Condor is vacating a node S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Migration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 15 Step 3: Checkpointing processes S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Migration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 16 S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Step 4: Process resumed on friendly Condor pool Migration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 17 Live demonstrations The prototype has been demonstrated in various conferences/ workshops EuroPar’03, Klagenfurt, Austria Hungarian Grid Day, Budapest, Hungary SuperComputing 2003, Phoenix, USA Cluster 2003, Hong-kong, China

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 18 P-GRADE GUI London - UoW Budapest - SZTAKI 1 P-GRADE program submitted to Budapest as a Condor job 2 P-GRADE program runs at SZTAKI cluster 3 P-GRADE program migrates to London as a Condor job 4 P-GRADE program runs at UoW cluster Budapest - BUTE SZTAKI & BUTE clusters overloaded  checkpointing Possible scenario on checkpointing and migration of PGRADE programs between clusters

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 19 Integrated checkpoint and monitor The checkpoint system is cooperating with the GRM-Mercury-PROVE monitoring and visualisation system –logs out the user process from the monitoring layer before termination –logs in the user process into the monitoring layer after resumption –user can trace the machines where process migrated

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 20 Migration among non-friendly Condor pools (under development) 5. Auto self-recovery of PGRADE application 4. Submit application to the queue 3. Transfer binaries, checkpoint files, work files 1. Detection of low resources on cluster 2. Removal of application from the queue P-GRADE environment GRID Application Manager CONDOR pool B CONDOR pool A It requires consultation with CONDOR developers…

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 21 Summary of advantages/disadvantages Advantages –no modification of the grid execution environment is required, since all checkpointing/migration capability is built inside the application –supports the day-night working mode in the Hungarian ClusterGrid environment –adaptivity and automation comes from Condor –Condor-PVM applications, with topology of any kind, can now be dynamically migrated like sequential jobs (Note: Condor does not checkpoint PVM applications, only fault-tolerant execution is supported for Master-Worker type applications) –migrating jobs can be monitored online and visualised Limitations –currently PGRADE generated PVM jobs are supported

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 22 Conclusion A parallel program checkpointing mechanism that can be applied to generic PVM programs. A checkpointing mechanism that can be connected to Condor in order to realize migration of PVM jobs among Condor pools. By integrating P-GRADE migration framework and the Mercury Grid monitor, PVM applications can be performance monitored and visualized even during their migration. Condor-PVM, through our checkpointing algorithm, is enhanced to checkpoint PVM applications like it is done for sequential jobs.

14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 23 Thank you for your attention! Jozsef Kovacs Information about P-GRADE: Next release is coming at the end of April… Information about Hungarian ClusterGrid: