Download presentation
Presentation is loading. Please wait.
Published byIrene Waters Modified over 9 years ago
1
Computer and Automation Research Institute Hungarian Academy of Sciences Automatic checkpoint of CONDOR-PVM applications by P-GRADE Jozsef Kovacs, Peter Kacsuk Laboratory of Parallel and Distributed Systems MTA SZTAKI, Budapest, Hungary {smith, kacsuk}@sztaki.hu http://www.lpds.sztaki.hu
2
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 2 2002 Hungarian Ministry of Education, NIIF – procurement project to equip universities, high schools, public libraries with PC labs. More than 2000 PCs, which were considered to be enormous, computational resources had been spread over the country. Grid Technical Board – the goal was to build up a minimal, but functional grid system. Dual-boot PC labs are connected throughout the country. Day-time operation – Windows desktop use, night-time operation – grid mode use. 24 hours operational “grid backbone” infrastructure. Around 800 PCs are interconnected at 400 Gflops performance via private networking solution (MPLS VPN) over the academic network. 1 st generation ClusterGrid – a single large Condor pool 2 nd generation ClusterGrid – a Condor based grid connected by web service and transaction based. Background: The Hungarian ClusterGrid
3
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 3 Condor pools are connected by a global Grid Resource Broker which uses dynamic UID/GID mapping for user jobs, and “one job – one directory structure” job format. Scalable, easy to manage system. In production since July 2003 with more than 30000 real user jobs executed. Applications range from fundamental research (mathematics, physics) to applied research (biology, chemistry). –investigation of C60 molecule in electromagnetic fields –simulation of protein molecules –fractal calculation –investigation of imbalanced phase transitions –etc. Two classes of applications are currently supported: parameter scanning, and master-worker jobs parallelized by PVM. For more info, http://www.clustergrid.iif.hu. Hungarian ClusterGrid Infrastructure
4
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 4 Hungarian ClusterGrid Infrastructure
5
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 5 Motivation Checkpointing and migration support is necessary To enable load balancing and To support fault-tolerance To support day-night working mode of Hungarian ClusterGrid etc. Automatic checkpointing for sequential jobs in standard universe is provided by Condor Fault-tolerant execution of Master-Worker style parallel jobs are supported without automatic checkpointing With the P-GRADE environment Condor is able to make automatic checkpointing for PVM jobs to enable load-balancing and to make long running worker processes fault-tolerant
6
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 6 P-GRADE environment Parallel Grid Run-time and Application Development Environment
7
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 7 Using P-GRADE job mode for the whole range of parallel/distributed systems P-GRADE PVMMPIWorkflow Super- computers ClustersGrids CondorGridGT2 GridOGSA
8
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 8 P-GRADE and Condor
9
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 9 Current prototype for migration framework First prototype is currently based on –P-GRADE –Condor –PVM Requirements –No manual code preparation is required –No user interaction during execution –No PVM modification –No extra requirements from schedulers –Just build your application using P-GRADE
10
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 10 Structure of P-GRADE application Built-in server client A client B client D client C Server process spawn/terminate identification/topology access to terminal/files Clients identification of neighbors by the server access to files/terminal through the server primitives for communication messag e passing messag e passing Terminal Files
11
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 11 Checkpointing a single process 1.Initiate a checkpoint 2.Synchronize transit messages and disconnect MP 3.Collect address- space information 4.Send checkpoint 5.Store checkpoint onto server 6.Reconnect to MP User process Checkpoint Server Storage handle MP 1 2 3 4 5 6 ckpt lib handle MP Vic Zandy’s single process checkpointer: www.cs.wisc.edu/~zandy/ckpt © University of Wisconsin, Madison (former member of the Paradyn group)
12
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 12 Modified structure to checkpoint processes Server/ coordination module Client A Client D Client B message passing library Files Checkpoint Server Storage ckpt lib Terminal Client C ckpt lib user code comm lib mp lib
13
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 13 Migration among friendly condor pools Step 1: Starting the application S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes
14
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 14 Step 2: Condor is vacating a node S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Migration among friendly condor pools
15
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 15 Step 3: Checkpointing processes S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Migration among friendly condor pools
16
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 16 S: Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes Step 4: Process resumed on friendly Condor pool Migration among friendly condor pools
17
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 17 Live demonstrations The prototype has been demonstrated in various conferences/ workshops EuroPar’03, Klagenfurt, Austria Hungarian Grid Day, Budapest, Hungary SuperComputing 2003, Phoenix, USA Cluster 2003, Hong-kong, China
18
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 18 P-GRADE GUI London - UoW Budapest - SZTAKI 1 P-GRADE program submitted to Budapest as a Condor job 2 P-GRADE program runs at SZTAKI cluster 3 P-GRADE program migrates to London as a Condor job 4 P-GRADE program runs at UoW cluster Budapest - BUTE SZTAKI & BUTE clusters overloaded checkpointing Possible scenario on checkpointing and migration of PGRADE programs between clusters
19
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 19 Integrated checkpoint and monitor The checkpoint system is cooperating with the GRM-Mercury-PROVE monitoring and visualisation system –logs out the user process from the monitoring layer before termination –logs in the user process into the monitoring layer after resumption –user can trace the machines where process migrated
20
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 20 Migration among non-friendly Condor pools (under development) 5. Auto self-recovery of PGRADE application 4. Submit application to the queue 3. Transfer binaries, checkpoint files, work files 1. Detection of low resources on cluster 2. Removal of application from the queue P-GRADE environment GRID Application Manager CONDOR pool B CONDOR pool A It requires consultation with CONDOR developers…
21
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 21 Summary of advantages/disadvantages Advantages –no modification of the grid execution environment is required, since all checkpointing/migration capability is built inside the application –supports the day-night working mode in the Hungarian ClusterGrid environment –adaptivity and automation comes from Condor –Condor-PVM applications, with topology of any kind, can now be dynamically migrated like sequential jobs (Note: Condor does not checkpoint PVM applications, only fault-tolerant execution is supported for Master-Worker type applications) –migrating jobs can be monitored online and visualised Limitations –currently PGRADE generated PVM jobs are supported
22
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 22 Conclusion A parallel program checkpointing mechanism that can be applied to generic PVM programs. A checkpointing mechanism that can be connected to Condor in order to realize migration of PVM jobs among Condor pools. By integrating P-GRADE migration framework and the Mercury Grid monitor, PVM applications can be performance monitored and visualized even during their migration. Condor-PVM, through our checkpointing algorithm, is enhanced to checkpoint PVM applications like it is done for sequential jobs.
23
14-16th April 2004 Paradyn/Condor week, Madison, USAAutomatic checkpoint of Condor-PVM applications by P-GRADE 23 Thank you for your attention! Jozsef Kovacs Information about P-GRADE: pgrade@sztaki.hu http://www.lpds.sztaki.hu/pgrade Next release is coming at the end of April… Information about Hungarian ClusterGrid: http://www.clustergrid.iif.hu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.