Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero.

Slides:

Advertisements

Similar presentations

Hadi Goudarzi and Massoud Pedram

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

High Performance Computing Course Notes Grid Computing.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

A Computation Management Agent for Multi-Institutional Grids

2 nd GADA Workshop / OTM 2005 Conferences Eduardo Huedo Rubén S. Montero Ignacio M. Llorente Advanced Computing Laboratory Center for.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.

Sergey Belov, LIT JINR 15 September, NEC’2011, Varna, Bulgaria.

Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.

Workload Management Massimo Sgaravatto INFN Padova.

Chapter 8: Network Operating Systems and Windows Server 2003-Based Networking Network+ Guide to Networks Third Edition.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

 Network Management  Network Administrators Jobs  Reasons for using Network Management Systems  Analysing Network Data  Points that must be taken.

Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker, Hongzhang Shan Future Technology Group Lawrence Berkeley Research.

 Escalonamento e Migração de Recursos e Balanceamento de carga Carlos Ferrão Lopes nº M6935 Bruno Simões nº M6082 Celina Alexandre nº M6807.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Optimal Power Control, Rate Adaptation and Scheduling for UWB-Based Wireless Networked Control Systems Sinem Coleri Ergen (joint with Yalcin Sadi) Wireless.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Semantic Interoperability Berlin, 25 March 2008 Semantically Enhanced Resource Allocator Marc de Palol Jorge Ejarque, Iñigo Goiri, Ferran Julià, Jordi.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.

A Grid fusion code for the Drift Kinetic Equation solver A.J. Rubio-Montero, E. Montes, M.Rodríguez, F.Castejón, R.Mayo CIEMAT. Avda Complutense, 22. Madrid.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Grid Workload Management Massimo Sgaravatto INFN Padova.

Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks S. Natarajan (CSU) C. Martín (UCM) J.L.

Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks, An Overview of the GridWay Metascheduler.

2.1 © 2004 Pearson Education, Inc. Exam Designing a Microsoft ® Windows ® Server 2003 Active Directory and Network Infrastructure Lesson 2: Examining.

INFSO-RI Enabling Grids for E-sciencE Workflows in Fusion applications José Luis Vázquez-Poletti Universidad.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

George Goulas, Christos Gogos, Panayiotis Alefragis, Efthymios Housos Computer Systems Laboratory, Electrical & Computer Engineering Dept., University.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

Introduction to Grid Computing and its components.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.

EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.

GWpilot: a personal pilot system A.J. Rubio-Montero, E. Huedo and R. Mayo-García EGI Technical Forum 2012 Prague – 20 Sep 2012.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Grid computing simulation of superconducting vortex lattice in superconducting magnetic nanostructures M. Rodríguez-Pascual 1, D. Pérez de Lara 2, E.M.

Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

GridWay Overview John-Paul Robinson University of Alabama at Birmingham SURAgrid All-Hands Meeting Washington, D.C. March 15, 2007.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Workload Management Workpackage

Design rationale and status of the org.glite.overlay component

Class project by Piyush Ranjan Satapathy & Van Lepham

GWpilot: a personal (or institutional) pilot system

Overview of Workflows: Why Use Them?

Presentation transcript:

Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero y R. Mayo-García CIEMAT CLCAR 2013 – San José, de agosto de 2013

Tools for an adaptative, efficient and fault tolerance execution of Tasks on dynamic distributed environments M. Rodríguez-Pascual, A.J. Rubio-Montero and R. Mayo-García CIEMAT CLCAR 2013 – San José, August, 26 th -30 th 2013

Problem – Justification General Architecture Montera GWpilot Results Conclusions Index CLCAR 2013 – San José, de agosto de

Justification CLCAR 2013 – San José, de agosto de The scientists’ computational requirements usually evolve along time – Peaks in the demand of resources versus underemployment of them – Energetic cost – Amortization Grid computing emerged as a complementary solution to those available in that time

Justification CLCAR 2013 – San José, de agosto de Nevertheless, the resources that form Grid infrastructures are highly dynamic – Belong to different administration domains – Different geographic situation – Variable stability and accessibility  Software updates and upgrades  Hardware problems  Maintenance works in the Data Centers  Etc.

Justification CLCAR 2013 – San José, de agosto de EGI infrastructure availability, March 2013

Justification CLCAR 2013 – San José, de agosto de Then, every application to be efficiently executed on a Grid infrastructure must be ready for supporting the variations in the service that, inevitably, it will have to deal with – Fault tolerance techniques – Adaptative scheduling of tasks  Number and characteristics of the available resources Two potential solutions will be presented – Montera – GWpilot

Scheduling of tasks on Grid… – Has been and is a well established line of research – Techniques in  Task replication (M/N, bag of tasks)  Task grouping/chunking (chunk ≡ nº de samples) But, in general… – It has not been specifically applied to Monte Carlo codes, parameter sweep applications… – The Grid dynamic nature has not been taken into account – The overhead has not been properly estimated – The proposed solutions have been tested on simulators or controlled environments Justification CLCAR 2013 – San José, de agosto de

Justification CLCAR 2013 – San José, de agosto de The proposed solutions are of interest for codes in the fields of – Radiophysics – Economy – Finances – Environment – High Energy Physics – Engineering – Statistics – Plasma physics – Chemistry – Industrial optimization techniques – Etc.

Architecture CLCAR 2013 – San José, de agosto de The solutions proposed in this work are based on GridWay 1 – Metascheduler which allows an unattended execution on a dynamic Grid environment – Check that all the stages of an execution are rightly performed – Delegate in services such as GIS, GT4, CREAM/GRAM It has already demonstrated better capabilities than other tools such as WMS 2 1 E. Huedo, et al., Future Generation Computer Systems 23, 252 (2007) 2 J.L. Vázquez-Poletti et al., Multiagent Grid Systems 3, 249 (2007)

Architecture CLCAR 2013 – San José, de agosto de But there are still many aspects that have not been successfully characterized in the Grid resources – Bandwidth – Connection shortcuts – Static information – Thresholds in the nodes that are not published – Etc. Based on these, it is possible to obtain an improvement in the efficiency of the codes execution: – Reducing the final execution time – Appropriate use of available resources.

Architecture CLCAR 2013 – San José, de agosto de Execution of distributed applications on Grid with a solution based on GridWay

Montera (Monte Carlo rápido) is a tool initially implemented for the execution of Monte Carlo codes – Java  Phyton This kind of codes are very useful for modeling complex problems – (Many) Simulations + Mathematics – Random seeds for initializing the simulations – Simple and independent tasks – Statistics for combining/joining the results – Its execution time can be approximated to a straight line T ejec ~ a· N + b (being N = number of samples) Montera CLCAR 2013 – San José, de agosto de

Montera dynamically modifies the number of tasks and where they are executed – DRMAA 3 – DyTSS (Dynamic Trapezoidal Self Scheduling)  Asymptotic performance & Half performance length It takes into account – The infrastructure status – The size of the I/O files – The bandwidth – The site performance To do so, it makes a profiling of the code and of the site 3 P. Tröger et al. Proc. CCGrid 2007, 619 (2007) Montera CLCAR 2013 – San José, de agosto de

Code profiling – Performs 1, 10, 100, 1000… simulations (or any other series) the first time it is executed – Obtains a & b (since T ejec ~ a· N + b) – Executes the Whetstone benchmark on the site – Normalization – Iterates this process in several sites to avoid misleading results Montera CLCAR 2013 – San José, de agosto de

Site Profiling – When  Whenever a new site is discovered or a change is found in an already accounted one (memory, CPU...) – How  Whetstone, due to floating point performance  Globus MDS – Why  The codes are also normalized, so the estimation of the time is simple and accurate Montera CLCAR 2013 – San José, de agosto de

Site Profiling – After every execution, these parameters are calculated again  Performance  Efficiency, failed tasks, available slots  Queue time  Bandwidth – Weighted calculations  Biessel Formula  The newest executions weights more than the older ones Montera CLCAR 2013 – San José, de agosto de

Montera CLCAR 2013 – San José, de agosto de Information retrieval for an adaptative scheduling in Montera

Pilot jobs benefits – Reduce the Grid complexity  Direct use and characterization of assigned WNs  Direct monitoring user tasks. – Fix task dispatching overheads  Remove the waiting time in remote queues  Remove middleware overheads and errors (CREAM,GRAM) – Reduce task error rate: middleware, hardware or connectivity – Increase compatibility  Creating special configurations  Implementing legacy communication protocols – Allows the implementation of advanced scheduling techniques GWpilot CLCAR 2013 – San José, de agosto de

There are by now pilot jobs systems: – Centralized (AliEn, PanDa, DIRAC, glideinWMS, etc.) – Application-oriented (DIANE, SAGA-BigJob)  Mono-user, mono-application In general: – They are not exploiting all the scheduling advantages provided by pilot jobs – They lack compatibility or adaptability aspects  Between middleware and legacy applications GWpilot CLCAR 2013 – San José, de agosto de

GWpilot main features: – Easy-to-install and standalone from remote middleware – Highly customizable and tunable, even by unskilled users – Multi-user with fair-share policies – Compatible with previously ported applications – Interoperable with diverse Grid infrastructures – Lightweight and scalable, achieving nearly optimal performance – Advanced scheduling policies for any kind of tasks GWpilot CLCAR 2013 – San José, de agosto de

GWpilot CLCAR 2013 – San José, de agosto de GridWay Core CREAMGRAMGWpilotServerMSD2GLUE CLIDRMAABESJSDL Applications Scheduler Allows submitting a % more pilots than the estimated free slots More accurate estimation of free slots pilot task CREAMCE site-BDII GLOBUSCEsite-BDII pilot HTTPSpull BDII tasktask GWpilotFactory pilots task

GWpilot CLCAR 2013 – San José, de agosto de GWpilot Architecture

Tests have been carried out on the EGI infrastructure in production status – fusion VO Environment GridWay Java Virtual Machine scheduling interval = 30 s dispatch chunk = 15 jobs maximum number of simultaneous jobs per user = 100 Results CLCAR 2013 – San José, de agosto de

Results CLCAR 2013 – San José, de agosto de Montera results with FAFNER2 4 (Tasks long in the order of seconds) The slowdown is normalized to Montera one 4 M. Rodríguez-Pascual et al. IEEE Tran. Plasma Sci. 38, 2102 (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%]

Results CLCAR 2013 – San José, de agosto de A. Bustos et al. Nuc. Fusion 50, (2010) SchedulerWalltime [mm:ss] Slowdon (relative) Fault rate [%] Montera results with ISDEP 5 (Tasks long in the order of hours) The slowdown is normalized to Montera one

Results CLCAR 2013 – San José, de agosto de Montera results with FastDEP 6 (Workflow of codes; Tasks long in the order of hours) The slowdown is normalized to Montera one 6 M. Rodríguez-Pascual et al. Plasma Phys. Contr. Fusion 55, (2013) SchedulerWalltime [mm:ss] Slowdon (relative)

Results CLCAR 2013 – San José, de agosto de A. Rubio-Montero et al. IEEE Tran. Plasma Sci. 38, 2093 (2010) Results of GWpilot with DKEsG 7 – – Tasks long in the order of seconds Some configuration parameters – – Pilot pulling interval: 45 seconds with 20 retries – – DKEsG polling time: 15 secs – – Resources are ranked/prioritized based on CPU speed

Results CLCAR 2013 – San José, de agosto de Used for submitting pilots   DKEsG only submits 500 BoTs   BoT Susp. timeout: 60 secs   Limited to 100 jobs per resource   Queues overloaded with 15% more pilots   Suspension timeout: 5 hours Max pilots submitted by GWpilot GISELA (prod.vo.eu-eela.eu) Discarded 32bits, duplicate and CERN resources.

Results CLCAR 2013 – San José, de agosto de Many overloaded sites Pilots die when a BoT is running inside Suspended BoTs because they have been assigned to death pilots Only 0.4% failing BoTs 62% failing grid jobs Makespan: 94 h: 27 m: 31 s. 610 times faster than sequential Total time wasted at remote queues: 1 year and 20 days Not appreciable by the user.

Scheduling of tasks have not taken into account sufficiently the dynamic nature of the Grid – An adaptative scheduling improves the performance of distributed codes Pilot jobs had a clear space for improvements Montera and GWpilot allow executing distributed applications on Grid infrastructures in a more robust, efficient and unattended way Conclusions CLCAR 2013 – San José, de agosto de

They could be a great asset in your computational challenges!! Conclusions CLCAR 2013 – San José, de agosto de

THANKS A LOT