Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007.

Similar presentations


Presentation on theme: "Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007."— Presentation transcript:

1 Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

2 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2 motivation -The Grids are complex and therefore prone to errors. -The distributed nature of the Grid makes scheduling of system maintenance hard. -Each uncoordinated power-down or failure effects in loss of currently running applications. -Loss of computation time means additional cost!

3 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3 goal To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.

4 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4 the solution Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment

5 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5 grid - model

6 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6 GCA in the Grid

7 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7 Proof of concept – the goals check whether the GCA survives contact with the reality prepare PoC on the basis of real-life installation the Grid with the GCA should provide additional value comparing with the traditional approach

8 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8 GCA proof of concept installation

9 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9 involved elements GUI: command line, Grid Sphere, Migrating Desktop Broker: GRMS Local Resource Manager: Globus + TORQUE Core service: SGIckpt

10 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10 Bottom-up approach How to make the checkpointer work with the local resource manager?

11 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11 pbs/torque special features action checkpoint action restart action checkpoint_abort

12 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12 config $action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta skid %path $action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid $restart_transmogrify true $action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid %jobid %sid %taskid %path Detailed description accessible on the http://checkpointing.psnc.pl

13 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13 Broker – local RM connectivity

14 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14 problem The checkpointer: a service or resource?

15 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15 pbs gsiftp://xxx.xxx.xxx.xxxl//home/user/povray ${JOB_ID} true 1 job description with checkpointing

16 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16 the end-user point of view

17 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17 manual scenario

18 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18 manual scenario - restart

19 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19 node-03.checkpointing.psnc.pl pbs gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long ${JOB_ID} true 1179315947518_matrix_demo_submit_0459 true 1

20 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20 failure – end-user view

21 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21 problem This semi-automatic solution is not optimal. How to introduce automatic job failure handling without introducing new functionality in the Broker? Use the workflows!

22 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22 the workflow Problem: using this broker we are not able to model loops

23 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23 automatic scenario

24 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24 end-user point of view

25 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25 the benefits user: more robust and fault-tolerant Grid environment sysadmin: much easier system management due to automatic checkpoint and recovery mechanism

26 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26 Thank you!


Download ppt "Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007."

Similar presentations


Ads by Google