Download presentation
Presentation is loading. Please wait.
Published byCecil Reeves Modified over 8 years ago
1
Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero y R. Mayo-García CIEMAT CLCAR 2013 – San José, 26-30 de agosto de 2013
2
Tools for an adaptative, efficient and fault tolerance execution of Tasks on dynamic distributed environments M. Rodríguez-Pascual, A.J. Rubio-Montero and R. Mayo-García CIEMAT CLCAR 2013 – San José, August, 26 th -30 th 2013
3
Problem – Justification General Architecture Montera GWpilot Results Conclusions Index CLCAR 2013 – San José, 26-30 de agosto de 2013 3
4
Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 4 The scientists’ computational requirements usually evolve along time – Peaks in the demand of resources versus underemployment of them – Energetic cost – Amortization Grid computing emerged as a complementary solution to those available in that time
5
Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 5 Nevertheless, the resources that form Grid infrastructures are highly dynamic – Belong to different administration domains – Different geographic situation – Variable stability and accessibility Software updates and upgrades Hardware problems Maintenance works in the Data Centers Etc.
6
Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 6 EGI infrastructure availability, March 2013
7
Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 7 Then, every application to be efficiently executed on a Grid infrastructure must be ready for supporting the variations in the service that, inevitably, it will have to deal with – Fault tolerance techniques – Adaptative scheduling of tasks Number and characteristics of the available resources Two potential solutions will be presented – Montera – GWpilot
8
Scheduling of tasks on Grid… – Has been and is a well established line of research – Techniques in Task replication (M/N, bag of tasks) Task grouping/chunking (chunk ≡ nº de samples) But, in general… – It has not been specifically applied to Monte Carlo codes, parameter sweep applications… – The Grid dynamic nature has not been taken into account – The overhead has not been properly estimated – The proposed solutions have been tested on simulators or controlled environments Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 8
9
Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 9 The proposed solutions are of interest for codes in the fields of – Radiophysics – Economy – Finances – Environment – High Energy Physics – Engineering – Statistics – Plasma physics – Chemistry – Industrial optimization techniques – Etc.
10
Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 10 The solutions proposed in this work are based on GridWay 1 – Metascheduler which allows an unattended execution on a dynamic Grid environment – Check that all the stages of an execution are rightly performed – Delegate in services such as GIS, GT4, CREAM/GRAM It has already demonstrated better capabilities than other tools such as WMS 2 1 E. Huedo, et al., Future Generation Computer Systems 23, 252 (2007) 2 J.L. Vázquez-Poletti et al., Multiagent Grid Systems 3, 249 (2007)
11
Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 11 But there are still many aspects that have not been successfully characterized in the Grid resources – Bandwidth – Connection shortcuts – Static information – Thresholds in the nodes that are not published – Etc. Based on these, it is possible to obtain an improvement in the efficiency of the codes execution: – Reducing the final execution time – Appropriate use of available resources.
12
Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 12 Execution of distributed applications on Grid with a solution based on GridWay
13
Montera (Monte Carlo rápido) is a tool initially implemented for the execution of Monte Carlo codes – Java Phyton This kind of codes are very useful for modeling complex problems – (Many) Simulations + Mathematics – Random seeds for initializing the simulations – Simple and independent tasks – Statistics for combining/joining the results – Its execution time can be approximated to a straight line T ejec ~ a· N + b (being N = number of samples) Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 13
14
Montera dynamically modifies the number of tasks and where they are executed – DRMAA 3 – DyTSS (Dynamic Trapezoidal Self Scheduling) Asymptotic performance & Half performance length It takes into account – The infrastructure status – The size of the I/O files – The bandwidth – The site performance To do so, it makes a profiling of the code and of the site 3 P. Tröger et al. Proc. CCGrid 2007, 619 (2007) Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 14
15
Code profiling – Performs 1, 10, 100, 1000… simulations (or any other series) the first time it is executed – Obtains a & b (since T ejec ~ a· N + b) – Executes the Whetstone benchmark on the site – Normalization – Iterates this process in several sites to avoid misleading results Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 15
16
Site Profiling – When Whenever a new site is discovered or a change is found in an already accounted one (memory, CPU...) – How Whetstone, due to floating point performance Globus MDS – Why The codes are also normalized, so the estimation of the time is simple and accurate Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 16
17
Site Profiling – After every execution, these parameters are calculated again Performance Efficiency, failed tasks, available slots Queue time Bandwidth – Weighted calculations Biessel Formula The newest executions weights more than the older ones Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 17
18
Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 18 Information retrieval for an adaptative scheduling in Montera
19
Pilot jobs benefits – Reduce the Grid complexity Direct use and characterization of assigned WNs Direct monitoring user tasks. – Fix task dispatching overheads Remove the waiting time in remote queues Remove middleware overheads and errors (CREAM,GRAM) – Reduce task error rate: middleware, hardware or connectivity – Increase compatibility Creating special configurations Implementing legacy communication protocols – Allows the implementation of advanced scheduling techniques GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 19
20
There are by now pilot jobs systems: – Centralized (AliEn, PanDa, DIRAC, glideinWMS, etc.) – Application-oriented (DIANE, SAGA-BigJob) Mono-user, mono-application In general: – They are not exploiting all the scheduling advantages provided by pilot jobs – They lack compatibility or adaptability aspects Between middleware and legacy applications GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 20
21
GWpilot main features: – Easy-to-install and standalone from remote middleware – Highly customizable and tunable, even by unskilled users – Multi-user with fair-share policies – Compatible with previously ported applications – Interoperable with diverse Grid infrastructures – Lightweight and scalable, achieving nearly optimal performance – Advanced scheduling policies for any kind of tasks GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 21
22
GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 22 GridWay Core CREAMGRAMGWpilotServerMSD2GLUE CLIDRMAABESJSDL Applications Scheduler Allows submitting a % more pilots than the estimated free slots More accurate estimation of free slots pilot task CREAMCE site-BDII GLOBUSCEsite-BDII pilot HTTPSpull BDII tasktask GWpilotFactory pilots task
23
GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 23 GWpilot Architecture
24
Tests have been carried out on the EGI infrastructure in production status – fusion VO Environment GridWay 5.6.0 Java Virtual Machine 1.5.0.09 scheduling interval = 30 s dispatch chunk = 15 jobs maximum number of simultaneous jobs per user = 100 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 24
25
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 25 Montera results with FAFNER2 4 (Tasks long in the order of seconds) The slowdown is normalized to Montera one 4 M. Rodríguez-Pascual et al. IEEE Tran. Plasma Sci. 38, 2102 (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%]
26
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 26 5 A. Bustos et al. Nuc. Fusion 50, 125007 (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%] Montera results with ISDEP 5 (Tasks long in the order of hours) The slowdown is normalized to Montera one
27
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 27 Montera results with FastDEP 6 (Workflow of codes; Tasks long in the order of hours) The slowdown is normalized to Montera one 6 M. Rodríguez-Pascual et al. Plasma Phys. Contr. Fusion 55, 085014 (2013) SchedulerWalltime [mm:ss] Slowdon (relative)
28
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 28 7 A. Rubio-Montero et al. IEEE Tran. Plasma Sci. 38, 2093 (2010) Results of GWpilot with DKEsG 7 – – Tasks long in the order of seconds Some configuration parameters – – Pilot pulling interval: 45 seconds with 20 retries – – DKEsG polling time: 15 secs – – Resources are ranked/prioritized based on CPU speed
29
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 29 Used for submitting pilots DKEsG only submits 500 BoTs BoT Susp. timeout: 60 secs Limited to 100 jobs per resource Queues overloaded with 15% more pilots Suspension timeout: 5 hours Max pilots submitted by GWpilot GISELA (prod.vo.eu-eela.eu) Discarded 32bits, duplicate and CERN resources.
30
Results CLCAR 2013 – San José, 26-30 de agosto de 2013 30 Many overloaded sites Pilots die when a BoT is running inside Suspended BoTs because they have been assigned to death pilots Only 0.4% failing BoTs 62% failing grid jobs Makespan: 94 h: 27 m: 31 s. 610 times faster than sequential Total time wasted at remote queues: 1 year and 20 days Not appreciable by the user.
31
Scheduling of tasks have not taken into account sufficiently the dynamic nature of the Grid – An adaptative scheduling improves the performance of distributed codes Pilot jobs had a clear space for improvements Montera and GWpilot allow executing distributed applications on Grid infrastructures in a more robust, efficient and unattended way Conclusions CLCAR 2013 – San José, 26-30 de agosto de 2013 31
32
They could be a great asset in your computational challenges!! http://www.ciemat.es/portal.do?IDR=343&TR=C http://www.gridway.org Conclusions CLCAR 2013 – San José, 26-30 de agosto de 2013 32
33
THANKS A LOT
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.