GWpilot: a personal pilot system A.J. Rubio-Montero, E. Huedo and R. Mayo-García EGI Technical Forum 2012 Prague – 20 Sep 2012.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Hadi Goudarzi and Massoud Pedram
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Workload Management Massimo Sgaravatto INFN Padova.
Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.
Chapter 11: Dial-Up Connectivity in Remote Access Designs
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Christopher Jeffers August 2012
Multiple-access Communication in Networks A Geometric View W. Chen & S. Meyn Dept ECE & CSL University of Illinois.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
1 port BOSS on Wenjing Wu (IHEP-CC)
KISTI’s Activities on the NA4 Biomed Cluster Soonwook Hwang, Sunil Ahn, Jincheol Kim, Namgyu Kim and Sehoon Lee KISTI e-Science Division.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
1 Chapter 12: VPN Connectivity in Remote Access Designs Designs That Include VPN Remote Access Essential VPN Remote Access Design Concepts Data Protection.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
OGF 25/EGEE User Forum Catania, March 2 nd 2009 Meta Scheduling and Advanced Application Support on the Spanish NGI Enol Fernández del Castillo (IFCA-CSIC)
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DORII Joint Research Activities DORII Joint Research Activities Status and Progress 6 th All-Hands-Meeting (AHM) Alexey Cheptsov on.
A Grid fusion code for the Drift Kinetic Equation solver A.J. Rubio-Montero, E. Montes, M.Rodríguez, F.Castejón, R.Mayo CIEMAT. Avda Complutense, 22. Madrid.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
D. A. Gates, R. B. White NSTX Physics Meeting 1/19/03
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
INFSO-RI Enabling Grids for E-sciencE SALUTE – Grid application for problems in quantum transport E. Atanassov, T. Gurov, A. Karaivanova,
INFSO-RI Enabling Grids for E-sciencE Workflows in Fusion applications José Luis Vázquez-Poletti Universidad.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
The EDGeS project receives Community research funding 1 Porting Applications to the EDGeS Infrastructure A comparison of the available methods, APIs, and.
George Goulas, Christos Gogos, Panayiotis Alefragis, Efthymios Housos Computer Systems Laboratory, Electrical & Computer Engineering Dept., University.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Paul Graham Software Architect, EPCC PCP – The P robes C oordination P rotocol A secure, robust framework.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
1 Recent Progress on QPS D. A. Spong, D.J. Strickler, J. F. Lyon, M. J. Cole, B. E. Nelson, A. S. Ware, D. E. Williamson Improved coil design (see recent.
Use of Performance Prediction Techniques for Grid Management Junwei Cao University of Warwick April 2002.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
Bootstrap current in quasi-symmetric stellarators Andrew Ware University of Montana Collaborators: D. A. Spong, L. A. Berry, S. P. Hirshman, J. F. Lyon,
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Herramientas para una ejecución adaptativa, eficiente y tolerante a fallos de tareas en entornos dinámicos distribuidos M. Rodríguez-Pascual, A.J. Rubio-Montero.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
E-science grid facility for Europe and Latin America Drift Kinetic Equation solver for Grid (DKEsG) A J. Rubio-Montero 1, L. A. Flores 1,
E-science grid facility for Europe and Latin America Executions of a Fusion Drift Kinetic Equation solver on Grid A J. Rubio-Montero.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
ARMS-CC Workshop – San Sebastián – July 20th, 2015 User-guided provisioning in federated clouds for distributed calculations A. J. Rubio-Montero 1, E.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
54th Annual Meeting of the Division of Plasma Physics, October 29 – November 2, 2012, Providence, Rhode Island 5-pin Langmuir probe measures floating potential.
CS-DC’15 World Conference – Phoenix – Sep 30th to Oct 1st, 2015 Consolidating user’s resource provisioning capabilities in cloud federations R. Mayo-García.
Advantages of adopting late-binding techniques through standardised interfaces for workflow managers. A.J. Rubio-Montero 1, M. Plociennik 2, I. Marín-Carrión.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Executions of the DKES code on the EELA-2 e-Infrastructure
Design rationale and status of the org.glite.overlay component
Seismic Hazard Analysis Using Distributed Workflows
David Cameron ATLAS Site Jamboree, 20 Jan 2017
GWpilot: a personal (or institutional) pilot system
Exploit the massive Volunteer Computing resource for HEP computation
Presentation transcript:

GWpilot: a personal pilot system A.J. Rubio-Montero, E. Huedo and R. Mayo-García EGI Technical Forum 2012 Prague – 20 Sep 2012

EGI TF 2012 – Prague, 20 Sep Outline  Common problems in Grid computation  Pilot Jobs  GWpilot  Advantages  Utilisation  Design and improvements  Suitability  DKEsG  Revision  Description of the calculation performed  Performance measurements  Some results  Conclusions

EGI TF 2012 – Prague, 20 Sep Common problems in Grid Computation  Variable overheads: queue waiting times, overload...  Variable error rate: connection cuts, jobs arbitrary aborted…  Diverse configurations: complexity, unexpected types of WN…  To increase performance by means of scheduling, the resources must be completely characterised, but it is impossible with generic middleware:  Design defects: GLUE specification is incomplete: no broadband, latency, queue policy, average waiting time, resource profile…  Miss-configurations and lacks of maintenance  Solutions  Self-scheduling  Models  Heuristics  Statistics  Pilot jobs

EGI TF 2012 – Prague, 20 Sep Pilot Jobs: basics SLOT APPROPIATION pilot Coordinator Server task pilot LRMS queue CE pilot Pilot Factory task monitoring task pilot

EGI TF 2012 – Prague, 20 Sep Pilot jobs: benefits  Reduce the Grid complexity:  direct use and characterization of assigned WNs  direct monitoring user tasks.  Fix task dispatching overheads  remove the waiting time in remote queues  remove middleware overheads and errors (CREAM,GRAM)  Reduce task error rate: middleware, hardware or connectivity  Increase compatibility  creating special configurations  Implementing legacy communication protocols  Allows the implementation of advanced scheduling techniques

EGI TF 2012 – Prague, 20 Sep Pilot Jobs: Implementations  Centralized frameworks: daunting maintenance, deployment and customization.  AliEn and PanDa (suitable for HEP users)  DIRAC  glideinWMS  EDGeS (XtremWeb) and GridBot (BOINC)  Application-oriented: mono-user, mono-application.  DIANE  They are not exploiting all the scheduling advantages provided by pilot jobs or they lack compatibility or adaptability aspects  Alternative  GWpilot

EGI TF 2012 – Prague, 20 Sep GWpilot: features  Easy-to-install and standalone from remote middleware  Highly customizable and tuneable, even by unskilled users  Multi-user with fair-share policies  Compatible with previously ported applications  Interoperable with diverse Grid infrastructures  Lightweight and scalable, achieving nearly optimal performance  Advanced scheduling policies for any kind of tasks

EGI TF 2012 – Prague, 20 Sep GWpilot: simplicity of use and configuration  GWpilot makes the use of pilot jobs automatic and unattended both to users and developers: # cat ls_template.jt EXECUTABLE = /bin/ls STDOUT_FILE = logs/ls.out.${ARCH}.${JOB_ID} STDERR_FILE = logs/ls.err.${ARCH}.${JOB_ID} REQUIREMENTS = LRMS_NAME = "jobmanager-pilot" RANK = CPU_MHZ # gwsubmit -t ls_template.jt  Usual configuration parameters  maximum of submitted pilots  dispatching suspension timeout (maximum time spent at remote LRMS)  pilot pulling interval against GWpilot and number of retries Users have only to fix this requirement in their tasks # cat gwd.conf … IM_MAD = pilot_im:gw_im_mad_pilot::dummy:pilot_em EM_MAD = pilot_em:gw_em_mad_pilot:-m 550 –t i 45 -f 20 :rsl_nsh …

EGI TF 2012 – Prague, 20 Sep GWpilot: integrated into GridWay metascheduler GridWay Core CREAMGRAM GWpilot Server MSD2 GLUE CLIDRMAABESJSDL Applications Scheduler Allows submitting a % more pilots than the estimated free slots More accurate estimation of free slots pilot task CREAM CE site-BDII GLOBUS CE site-BDII pilot HTTPS pull BDII task GWpilot Factory pilots task

EGI TF 2012 – Prague, 20 Sep GWpilot: suitability for distributed applications  Could give a boost to your computational challenges !!!  Legacy applications previously ported to GridWay or to DRMAA/BES/JSDL standards can directly benefit from GWpilot.  Examples:  Truba/MaRaTra  VMEC  ISDEP  FAFNER-2  gGEM  DKEsG : Drift Kinetic Equation solver for Grid

EGI TF 2012 – Prague, 20 Sep DKEsG: calculating NC transport of Fusion devices * 1 A. J. Rubio-Montero et al. “Drift Kinetic Equation Solver for Grid (DKEsG),” IEEE Trans.Plasma Sci., 38( 9) * 2 D. A. Spong, “Generation and damping of neoclassical plasma flows in stellarators,” Phys. Plasmas, 12(5), Fluxes through the surfaces generated by the magnetic field lines: DRMAA-enabled producer-consumer workflow: chunking tasks and polling time for BoT states are customizable The DKEsG Workflow * 1 Updated with Spong’s DKEs code* 2 NC transport coefficients

EGI TF 2012 – Prague, 20 Sep Experiment: DKEsG parameter scan with the TJ-II standard configuration r[2…141] X EFIELD[-250…250:10] X CMUL[(1…9)10(-7…0)] = 514,080 independent tasks (1 to 12 min proportional to radius) 420 tasks 103,236 independent BoTs 5 tasks X BoT 6.58 years on Intel Xeon X5365 3GHz (64bit)

EGI TF 2012 – Prague, 20 Sep Resources used and configuration bounds GISELA infrastructure (prod.vo.eu-eela.eu) Discarded 32bits, duplicate and CERN resources. Used for submitting pilots  DKEsG only submits 500 BoTs  BoT Susp. timeout: 60 secs  Limited to 100 jobs per resource  queues overloaded with 15% more pilots  suspension timeout: 5 hours Max pilots submitted by GWpilot  Other configuration parameters:  Pilot pulling interval : 45 seconds with 20 retries.  DKEsG polling time: 15 secs.  Resources are ranked/prioritised based on CPU speed

EGI TF 2012 – Prague, 20 Sep Experiment: measured computational results Many overloaded sites Pilots die when a BoT is running inside Suspended BoTs because they have been assigned to death pilots Only 0.4% failing BoTs 62% failing grid jobs Makespan: 94 h: 27 m: 31 s. 610 times faster than sequential Total time wasted at remote queues: 1 year and 20 days. Not appreciable by the user.

EGI TF 2012 – Prague, 20 Sep Experiment: turnaround measurements DKEsG cannot supply enough BoTs The number of available pilots are lower than 500 Pilot overhead is always between secs  Scalability of GWpilot Accumulated turnaround overhead is only 6.21%. If only GridWay were used the resultant one* would be % * A.J. Rubio-Montero et al, “Executions of a Drift Kinetic Ecuation solver on Grid,” in PDP 2010, Pisa, Italy.

EGI TF 2012 – Prague, 20 Sep Plasma results: bootstrap current (D 13 ) of the outer radial plasma position in TJ-II (negative polarization) This surface needs: 3672 DKEsG-Mono tasks = 663 CPU hours consumed from the Grid Boostrap current tends to zero  The collisionless asymptotic value (which depends on the configuration) is recovered.  Larger uncertainties appear in the long mean free path regime. As expected, the coefficients are even in the electric field.

EGI TF 2012 – Prague, 20 Sep Plasma results: normalized NC transport coefficients (L 33 ) from the resistivity enhancement (positive polarization)  CMUL and EFIELD parameters decrease monotonically as K increases.  By solving the integration in K, there is a continuous reutilization of the data included in the database

EGI TF 2012 – Prague, 20 Sep Conclusions  Summary  GWpilot is suitable to easily improve the performance of several kinds of fusion codes.  New features have been implemented in GridWay and DKEsG.  DKEsG execution shows impressive improvements in terms of makespan and turnaround.  Future work  Continue the DKEsG calculations in order to build an extensive database for several fusion devices that allows the user to read the monoenergetic coefficients and to obtain the final fluxes without performing again the calculations.  We are evaluating GWpilot with other applications from other scientific areas.  More information at:  

EGI TF 2012 – Prague, 20 Sep Thanks for your attention