Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2 motivation -The Grids are complex and therefore prone to errors. -The distributed nature of the Grid makes scheduling of system maintenance hard. -Each uncoordinated power-down or failure effects in loss of currently running applications. -Loss of computation time means additional cost!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3 goal To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4 the solution Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5 grid - model
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6 GCA in the Grid
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7 Proof of concept – the goals check whether the GCA survives contact with the reality prepare PoC on the basis of real-life installation the Grid with the GCA should provide additional value comparing with the traditional approach
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8 GCA proof of concept installation
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9 involved elements GUI: command line, Grid Sphere, Migrating Desktop Broker: GRMS Local Resource Manager: Globus + TORQUE Core service: SGIckpt
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10 Bottom-up approach How to make the checkpointer work with the local resource manager?
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11 pbs/torque special features action checkpoint action restart action checkpoint_abort
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12 config $action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta skid %path $action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid $restart_transmogrify true $action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid %jobid %sid %taskid %path Detailed description accessible on the
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13 Broker – local RM connectivity
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14 problem The checkpointer: a service or resource?
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15 pbs gsiftp://xxx.xxx.xxx.xxxl//home/user/povray ${JOB_ID} true 1 job description with checkpointing
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16 the end-user point of view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17 manual scenario
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18 manual scenario - restart
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19 node-03.checkpointing.psnc.pl pbs gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long ${JOB_ID} true _matrix_demo_submit_0459 true 1
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20 failure – end-user view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21 problem This semi-automatic solution is not optimal. How to introduce automatic job failure handling without introducing new functionality in the Broker? Use the workflows!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22 the workflow Problem: using this broker we are not able to model loops
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23 automatic scenario
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24 end-user point of view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25 the benefits user: more robust and fault-tolerant Grid environment sysadmin: much easier system management due to automatic checkpoint and recovery mechanism
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26 Thank you!