Presentation is loading. Please wait.

Presentation is loading. Please wait.

Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero.

Similar presentations


Presentation on theme: "Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero."— Presentation transcript:

1 Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero y R. Mayo-García CIEMAT CLCAR 2013 – San José, 26-30 de agosto de 2013

2 Tools for an adaptative, efficient and fault tolerance execution of Tasks on dynamic distributed environments M. Rodríguez-Pascual, A.J. Rubio-Montero and R. Mayo-García CIEMAT CLCAR 2013 – San José, August, 26 th -30 th 2013

3 Problem – Justification General Architecture Montera GWpilot Results Conclusions Index CLCAR 2013 – San José, 26-30 de agosto de 2013 3

4 Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 4 The scientists’ computational requirements usually evolve along time – Peaks in the demand of resources versus underemployment of them – Energetic cost – Amortization Grid computing emerged as a complementary solution to those available in that time

5 Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 5 Nevertheless, the resources that form Grid infrastructures are highly dynamic – Belong to different administration domains – Different geographic situation – Variable stability and accessibility  Software updates and upgrades  Hardware problems  Maintenance works in the Data Centers  Etc.

6 Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 6 EGI infrastructure availability, March 2013

7 Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 7 Then, every application to be efficiently executed on a Grid infrastructure must be ready for supporting the variations in the service that, inevitably, it will have to deal with – Fault tolerance techniques – Adaptative scheduling of tasks  Number and characteristics of the available resources Two potential solutions will be presented – Montera – GWpilot

8 Scheduling of tasks on Grid… – Has been and is a well established line of research – Techniques in  Task replication (M/N, bag of tasks)  Task grouping/chunking (chunk ≡ nº de samples) But, in general… – It has not been specifically applied to Monte Carlo codes, parameter sweep applications… – The Grid dynamic nature has not been taken into account – The overhead has not been properly estimated – The proposed solutions have been tested on simulators or controlled environments Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 8

9 Justification CLCAR 2013 – San José, 26-30 de agosto de 2013 9 The proposed solutions are of interest for codes in the fields of – Radiophysics – Economy – Finances – Environment – High Energy Physics – Engineering – Statistics – Plasma physics – Chemistry – Industrial optimization techniques – Etc.

10 Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 10 The solutions proposed in this work are based on GridWay 1 – Metascheduler which allows an unattended execution on a dynamic Grid environment – Check that all the stages of an execution are rightly performed – Delegate in services such as GIS, GT4, CREAM/GRAM It has already demonstrated better capabilities than other tools such as WMS 2 1 E. Huedo, et al., Future Generation Computer Systems 23, 252 (2007) 2 J.L. Vázquez-Poletti et al., Multiagent Grid Systems 3, 249 (2007)

11 Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 11 But there are still many aspects that have not been successfully characterized in the Grid resources – Bandwidth – Connection shortcuts – Static information – Thresholds in the nodes that are not published – Etc. Based on these, it is possible to obtain an improvement in the efficiency of the codes execution: – Reducing the final execution time – Appropriate use of available resources.

12 Architecture CLCAR 2013 – San José, 26-30 de agosto de 2013 12 Execution of distributed applications on Grid with a solution based on GridWay

13 Montera (Monte Carlo rápido) is a tool initially implemented for the execution of Monte Carlo codes – Java  Phyton This kind of codes are very useful for modeling complex problems – (Many) Simulations + Mathematics – Random seeds for initializing the simulations – Simple and independent tasks – Statistics for combining/joining the results – Its execution time can be approximated to a straight line T ejec ~ a· N + b (being N = number of samples) Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 13

14 Montera dynamically modifies the number of tasks and where they are executed – DRMAA 3 – DyTSS (Dynamic Trapezoidal Self Scheduling)  Asymptotic performance & Half performance length It takes into account – The infrastructure status – The size of the I/O files – The bandwidth – The site performance To do so, it makes a profiling of the code and of the site 3 P. Tröger et al. Proc. CCGrid 2007, 619 (2007) Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 14

15 Code profiling – Performs 1, 10, 100, 1000… simulations (or any other series) the first time it is executed – Obtains a & b (since T ejec ~ a· N + b) – Executes the Whetstone benchmark on the site – Normalization – Iterates this process in several sites to avoid misleading results Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 15

16 Site Profiling – When  Whenever a new site is discovered or a change is found in an already accounted one (memory, CPU...) – How  Whetstone, due to floating point performance  Globus MDS – Why  The codes are also normalized, so the estimation of the time is simple and accurate Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 16

17 Site Profiling – After every execution, these parameters are calculated again  Performance  Efficiency, failed tasks, available slots  Queue time  Bandwidth – Weighted calculations  Biessel Formula  The newest executions weights more than the older ones Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 17

18 Montera CLCAR 2013 – San José, 26-30 de agosto de 2013 18 Information retrieval for an adaptative scheduling in Montera

19 Pilot jobs benefits – Reduce the Grid complexity  Direct use and characterization of assigned WNs  Direct monitoring user tasks. – Fix task dispatching overheads  Remove the waiting time in remote queues  Remove middleware overheads and errors (CREAM,GRAM) – Reduce task error rate: middleware, hardware or connectivity – Increase compatibility  Creating special configurations  Implementing legacy communication protocols – Allows the implementation of advanced scheduling techniques GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 19

20 There are by now pilot jobs systems: – Centralized (AliEn, PanDa, DIRAC, glideinWMS, etc.) – Application-oriented (DIANE, SAGA-BigJob)  Mono-user, mono-application In general: – They are not exploiting all the scheduling advantages provided by pilot jobs – They lack compatibility or adaptability aspects  Between middleware and legacy applications GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 20

21 GWpilot main features: – Easy-to-install and standalone from remote middleware – Highly customizable and tunable, even by unskilled users – Multi-user with fair-share policies – Compatible with previously ported applications – Interoperable with diverse Grid infrastructures – Lightweight and scalable, achieving nearly optimal performance – Advanced scheduling policies for any kind of tasks GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 21

22 GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 22 GridWay Core CREAMGRAMGWpilotServerMSD2GLUE CLIDRMAABESJSDL Applications Scheduler Allows submitting a % more pilots than the estimated free slots More accurate estimation of free slots pilot task CREAMCE site-BDII GLOBUSCEsite-BDII pilot HTTPSpull BDII tasktask GWpilotFactory pilots task

23 GWpilot CLCAR 2013 – San José, 26-30 de agosto de 2013 23 GWpilot Architecture

24 Tests have been carried out on the EGI infrastructure in production status – fusion VO Environment GridWay 5.6.0 Java Virtual Machine 1.5.0.09 scheduling interval = 30 s dispatch chunk = 15 jobs maximum number of simultaneous jobs per user = 100 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 24

25 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 25 Montera results with FAFNER2 4 (Tasks long in the order of seconds) The slowdown is normalized to Montera one 4 M. Rodríguez-Pascual et al. IEEE Tran. Plasma Sci. 38, 2102 (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%]

26 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 26 5 A. Bustos et al. Nuc. Fusion 50, 125007 (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%] Montera results with ISDEP 5 (Tasks long in the order of hours) The slowdown is normalized to Montera one

27 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 27 Montera results with FastDEP 6 (Workflow of codes; Tasks long in the order of hours) The slowdown is normalized to Montera one 6 M. Rodríguez-Pascual et al. Plasma Phys. Contr. Fusion 55, 085014 (2013) SchedulerWalltime [mm:ss] Slowdon (relative)

28 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 28 7 A. Rubio-Montero et al. IEEE Tran. Plasma Sci. 38, 2093 (2010) Results of GWpilot with DKEsG 7 – – Tasks long in the order of seconds Some configuration parameters – – Pilot pulling interval: 45 seconds with 20 retries – – DKEsG polling time: 15 secs – – Resources are ranked/prioritized based on CPU speed

29 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 29 Used for submitting pilots   DKEsG only submits 500 BoTs   BoT Susp. timeout: 60 secs   Limited to 100 jobs per resource   Queues overloaded with 15% more pilots   Suspension timeout: 5 hours Max pilots submitted by GWpilot GISELA (prod.vo.eu-eela.eu) Discarded 32bits, duplicate and CERN resources.

30 Results CLCAR 2013 – San José, 26-30 de agosto de 2013 30 Many overloaded sites Pilots die when a BoT is running inside Suspended BoTs because they have been assigned to death pilots Only 0.4% failing BoTs 62% failing grid jobs Makespan: 94 h: 27 m: 31 s. 610 times faster than sequential Total time wasted at remote queues: 1 year and 20 days Not appreciable by the user.

31 Scheduling of tasks have not taken into account sufficiently the dynamic nature of the Grid – An adaptative scheduling improves the performance of distributed codes Pilot jobs had a clear space for improvements Montera and GWpilot allow executing distributed applications on Grid infrastructures in a more robust, efficient and unattended way Conclusions CLCAR 2013 – San José, 26-30 de agosto de 2013 31

32 They could be a great asset in your computational challenges!! http://www.ciemat.es/portal.do?IDR=343&TR=C http://www.gridway.org Conclusions CLCAR 2013 – San José, 26-30 de agosto de 2013 32

33 THANKS A LOT


Download ppt "Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero."

Similar presentations


Ads by Google