Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pre-GDB on Batch Systems (Bologna)11 th March 2014 1 Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)

Similar presentations


Presentation on theme: "Pre-GDB on Batch Systems (Bologna)11 th March 2014 1 Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)"— Presentation transcript:

1 Pre-GDB on Batch Systems (Bologna)11 th March 2014 1 Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)

2 Pre-GDB on Batch Systems (Bologna)11 th March 2014 2 Outline ‣ System overview ‣ Successful experience (NIKHEF and PIC) ‣ Torque/Maui current situation ‣ Torque overview ‣ Maui overview ‣ Outlook

3 Pre-GDB on Batch Systems (Bologna)11 th March 2014 3 System overview ‣ TORQUE is a community and commercial effort based on OpenPBS project. It improves scalability, enables fault tolerance and many other features ‣ http://www.adaptivecomputing.com/products/open-source/torque/ ‣ Maui Cluster Scheduler is a job scheduler capable of supporting multiple scheduling policies. It is free and open- source software ‣ http://www.adaptivecomputing.com/products/open-source/maui/

4 Pre-GDB on Batch Systems (Bologna)11 th March 2014 4 System overview ‣ TORQUE/Maui system has the usual batch system capabilities: ‣ Queues definition (routing queues) ‣ Accounting ‣ Reservation/QOS/Partition ‣ FairShare ‣ Backfilling ‣ Handling of SMP and MPI jobs ‣ Multicore allocation and job backfilling ensure that Torque/Maui is capable of supporting multicore jobs

5 Pre-GDB on Batch Systems (Bologna)11 th March 2014 5 Succesful experience ‣ NIKHEF and PIC are multi-VO sites with local & Grid users ‣ Succesful experience during first LHC run with Torque/Maui system ‣ Currently, both are running Torque-2.5.13 + Maui-3.3.4 ‣ NIKHEF: 30% non-HEP, 55% WLCG, rest non-WLCG HEP or local jobs. Highly non-uniform workload ‣ 3800 jobs slots ‣ 97.5% utilization (last 12 months) ‣ 2000 waiting jobs (average)

6 Pre-GDB on Batch Systems (Bologna)11 th March 2014 6 Succesful experience NIKHEF: running jobs (last year) NIKHEF: queued jobs (last year)

7 Pre-GDB on Batch Systems (Bologna)11 th March 2014 7 Succesful experience ‣ PIC: 3% non-HEP, 83% Tier-1 WLCG, 12% ATLAS Tier-2, rest local jobs (ATLAS Tier-3, T2K, MAGIC,…) ‣ 3500 jobs slots ‣ 95% approx utilization (last 12 months) ‣ 2500 waiting jobs (average)

8 Pre-GDB on Batch Systems (Bologna)11 th March 2014 8 Succesful experience PIC: running jobs (last year)

9 Pre-GDB on Batch Systems (Bologna)11 th March 2014 9 Succesful experience PIC: queued jobs (last year)

10 Pre-GDB on Batch Systems (Bologna)11 th March 2014 10 Torque overview ‣ Torque has a very active community: ‣ Mailing list: torqueusers@supercluster.org ‣ Total free support from Adaptive Computing ‣ New releases each year (approx. or less) and frequent new patches ‣ 2.5.13 is the last release of branch 2.5.X

11 Pre-GDB on Batch Systems (Bologna)11 th March 2014 11 Torque overview

12 Pre-GDB on Batch Systems (Bologna)11 th March 2014 12 Torque overview ‣ Torque is well integrated with EMI middleware ‣ Vastly used in WLCG Grid sites (~75% of sites in BDii -pbs-) ‣ No complex to install, configure and manage: ‣ via qmgr tool ‣ plain text accounting ‣ … ‣ Torque scalability issues ‣ Reported for branch 2.5.X ‣ Not detected at our scale ‣ Branch 4.2.X presents significant enhancements to scalability for large environments, responsiveness, reliability, …

13 Pre-GDB on Batch Systems (Bologna)11 th March 2014 13 Maui overview ‣ Support: Maui is no longer supported by Adaptive Computing ‣ Documentation: ‣ Poor documentation causes initial complexity to install it ‣ Things do not always work like the documentation suggests ‣ Scalability issues: ‣ At ~8000 queued jobs, Maui hangs ‣ MAXIJOBS parameter can be adjusted to limit the number of jobs consider for scheduling ‣ This solves this issue (currently in production in NIKHEF)

14 Pre-GDB on Batch Systems (Bologna)11 th March 2014 14 Maui overview ‣ Moab is the non-free scheduler supported by Adaptive Computing and based in Maui ‣ Aims to increase the scalability ‣ It is a continued commercial support ‣ Configuration files are very similar to the ones in Maui: http://docs.adaptivecomputing.com/mwm/help.htm#a.kmauimigrate.html ‣ Feedback from sites running Torque/Moab would be a good complement to this review

15 Pre-GDB on Batch Systems (Bologna)11 th March 2014 15 Outlook ‣ Torque/Maui scalability issues ‣ Only relevant for larger sites ‣ feasible option for small-medium size sites ‣ Might be well solved in 4.2.X branch and tunning Maui options ‣ Actually, multicore jobs reduces the number of jobs to be handled by the system ‣ for sites that are predominantly WLCG (eg PIC at 95%), switching to a pure multicore load would further reduce scheduling issues at the site level. ‣ for sites that are much less WLCG dominated (eg Nikhef at 55%), a switch to pure multicore load might actually increase scheduling issues at the site level, as this move would remove much of the entropy which allows reaching 97% utilization. ‣ Another concern is the support for the systems, being Maui the weakest link for the Torque/Maui combination

16 Pre-GDB on Batch Systems (Bologna)11 th March 2014 16 Outlook ‣ Some future options ‣ Change from Maui to Moab (but, it is not free!) ‣ Setting up a kind of “OpenMaui” project within WLCG-sites as a community effort to provide support and improvements to Maui ‣ Integrate with another scheduler. Which one? ‣ Complete change to another system (SLURM, HTCondor, …) ‣ “Do nothing” until a real problem arrives ‣ Currently, just a worry, no real problem detected so far in PIC/NIKHEF ‣ Improvements from migrating to another system unclear

17 Pre-GDB on Batch Systems (Bologna)11 th March 2014 17 Outlook ‣ Questions: ‣ If decided for WLCG sites to move away from Torque/Maui, would it be feasible before the LHC Run2? ‣ Migration to a new batch system requires time and effort, thus manpower and expertise, in order to reach and adequate performance for a Grid site ‣ Not clear if needed before Run2 ‣ What happens with sites shared with non-WLCG VOs? ‣ Impact on other users (NIKHEF 45%) ‣ For PIC, several disciplines rely on local job submissions. A change on the batch system affects many users, and requires re-education, changes, and tests of their submission tools to adapt to an eventual new system


Download ppt "Pre-GDB on Batch Systems (Bologna)11 th March 2014 1 Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)"

Similar presentations


Ads by Google