Download presentation
Presentation is loading. Please wait.
Published byDevin Robinson Modified over 11 years ago
1
Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007
2
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2 motivation -The Grids are complex and therefore prone to errors. -The distributed nature of the Grid makes scheduling of system maintenance hard. -Each uncoordinated power-down or failure effects in loss of currently running applications. -Loss of computation time means additional cost!
3
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3 goal To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.
4
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4 the solution Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment
5
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5 grid - model
6
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6 GCA in the Grid
7
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7 Proof of concept – the goals check whether the GCA survives contact with the reality prepare PoC on the basis of real-life installation the Grid with the GCA should provide additional value comparing with the traditional approach
8
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8 GCA proof of concept installation
9
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9 involved elements GUI: command line, Grid Sphere, Migrating Desktop Broker: GRMS Local Resource Manager: Globus + TORQUE Core service: SGIckpt
10
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10 Bottom-up approach How to make the checkpointer work with the local resource manager?
11
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11 pbs/torque special features action checkpoint action restart action checkpoint_abort
12
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12 config $action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta skid %path $action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid $restart_transmogrify true $action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid %jobid %sid %taskid %path Detailed description accessible on the http://checkpointing.psnc.pl
13
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13 Broker – local RM connectivity
14
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14 problem The checkpointer: a service or resource?
15
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15 pbs gsiftp://xxx.xxx.xxx.xxxl//home/user/povray ${JOB_ID} true 1 job description with checkpointing
16
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16 the end-user point of view
17
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17 manual scenario
18
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18 manual scenario - restart
19
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19 node-03.checkpointing.psnc.pl pbs gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long ${JOB_ID} true 1179315947518_matrix_demo_submit_0459 true 1
20
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20 failure – end-user view
21
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21 problem This semi-automatic solution is not optimal. How to introduce automatic job failure handling without introducing new functionality in the Broker? Use the workflows!
22
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22 the workflow Problem: using this broker we are not able to model loops
23
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23 automatic scenario
24
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24 end-user point of view
25
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25 the benefits user: more robust and fault-tolerant Grid environment sysadmin: much easier system management due to automatic checkpoint and recovery mechanism
26
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26 Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.