Batch Scheduling at CERN (LSF) Hepix Spring Meeting 2005 Tim Bell IT/FIO Fabric Services.

Slides:



Advertisements
Similar presentations
Paging: Design Issues. Readings r Silbershatz et al: ,
Advertisements

CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.
Resource Management with YARN: YARN Past, Present and Future
Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.
© Ibrahim Korpeoglu Bilkent University
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
CPU Scheduling Chapter 6 Chapter 6.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Kento Aida, Tokyo Institute of Technology Grid Challenge - programming competition on the Grid - Kento Aida Tokyo Institute of Technology 22nd APAN Meeting.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999.
1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Process A program in execution. But, What does it mean to be “in execution”?
M. Schott (CERN) Page 1 CERN Group Tutorials CAT Tier-3 Tutorial October 2009.
2.5 Scheduling Given a multiprogramming system. Given a multiprogramming system. Many times when more than 1 process is waiting for the CPU (in the ready.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 7: CPU Scheduling Chapter 5.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Update on Windows 7 at CERN & Remote Desktop.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Experimental Comparative Study of Job Management Systems George Washington University George Mason University
Cpr E 308 Spring 2005 Process Scheduling Basic Question: Which process goes next? Personal Computers –Few processes, interactive, low response time Batch.
Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
UNIX Unit 1- Architecture of Unix - By Pratima.
2.5 Scheduling. Given a multiprogramming system, there are many times when more than 1 process is waiting for the CPU (in the ready queue). Given a multiprogramming.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Batch Systems P. Nilsson, PROOF Meeting, October 18, 2005.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
CSC414 “Introduction to UNIX/ Linux” Lecture 3
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
CPU Scheduling Operating Systems CS 550. Last Time Deadlock Detection and Recovery Methods to handle deadlock – Ignore it! – Detect and Recover – Avoidance.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,
Gridengine Configuration review ● Gridengine overview ● Our current setup ● The scheduler ● Scheduling policies ● Stats from the clusters.
Processes and threads.
OpenPBS – Distributed Workload Management System
EEE Embedded Systems Design Process in Operating Systems 서강대학교 전자공학과
Operating Systems (CS 340 D)
Practical aspects of multi-core job submission at CERN
QoS in the Tier1 batch system(LSF)
PES Lessons learned from large scale LSF scalability tests
Chapter 2 Scheduling.
Virtualization in the gLite Grid Middleware software process
Developments in Batch and the Grid
Operating Systems CPU Scheduling.
Lecture 23: Process Scheduling for Interactive Systems
Lecture 21: Introduction to Process Scheduling
Chapter 5: CPU Scheduling
CPU SCHEDULING.
Lecture 21: Introduction to Process Scheduling
Chapter 6: Scheduling Algorithms Dr. Amjad Ali
Scheduling 21 May 2019.
Presentation transcript:

Batch Scheduling at CERN (LSF) Hepix Spring Meeting 2005 Tim Bell IT/FIO Fabric Services

03/05/2005Batch Scheduling at CERN End User Clusters lxplus001 lxbatch001 lxb0001 DNS-like load balancing LSF CLI or Grid UI disk001 rfio tape001 rfio lxfsrk123 tpsrv Batch Servers “lxbatch” 44 Interactive Servers “lxplus” 400 Disk Servers 80 Tape Servers

03/05/2005Batch Scheduling at CERN Hard/software of clusters  Typically we make two acquisitions per year  Interactive login:  44 lxplus dual 2.8 GHz, 2GB memory, 80GB disk, slc3  Batch worker nodes - /pool is working space:  85 lxbatch dual 800MHz, 512MB memory, 8GB /pool,redhat7  86 lxbatch dual 1 GHz, 1GB memory, 8GB /pool, slc3  538 lxbatch dual 2.4 GHz, 1GB memory, 40GB /pool, slc3  226 lxbatch dual 2.8 GHz, 2GB memory, 40GB /pool, slc3  On order 225 lxbatch dual 2.8GHz, 2GB memory, 120GB disk, slc3

03/05/2005Batch Scheduling at CERN Current CERN LSF queues  View queue information  bqueues [options] [queue-name]  % bqueues  QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP  grid_dteam 20 Open:Active  grid_cms 20 Open:Active  grid_atlas 20 Open:Active  grid_lhcb 20 Open:Active  grid_alice 20 Open:Active  lcgtest 20 Open:Active  system_test 10 Open:Active  8nm 7 Open:Active  1nh 6 Open:Active  8nh 5 Open:Active  cmsprs 5 Open:Active  1nd 4 Open:Active  2nd 4 Open:Active  1nw 3 Open:Active  prod100 2 Open:Active  cmsdc04 2 Open:Active  prod200 1 Open:Active  prod400 1 Open:Active

03/05/2005Batch Scheduling at CERN Queues are based on cputime requirements  8nm8 normalised minutes  1nh1 normalised hour  8nh8 normalised hours  1nd1 normalised day  1nw1 normalised week  The normalisation changes every few years as machines get faster. We recently converted to kilo-specint2000 units so that one cpu hour on a 2.8GHz farm PC is almost exactly one normalised hour (they are rated at KSI2K).  Additional low priority production queues for reserved users with high job concurrency allow 1nw  Grid queues use mapped team accounts and allow 1nw cpu time. Further refinements will be investigated.  All queues can have resource or user group restrictions

03/05/2005Batch Scheduling at CERN

03/05/2005Batch Scheduling at CERN

03/05/2005Batch Scheduling at CERN Queue issues  The queue definitions are essentially empirically defined to match users requirements for turnaround times where a user could expect many short jobs per day and a few long jobs overnight.  Production queues are for low priority work where specific turnaround is not an issue. The three queues allow higher numbers of concurrent jobs at decreasing priority.  There is a grid queue for each VO but without any cpu-time granularity.  There is a local queue funded by an experiment for fast turnaround of analysis jobs.  Many experiments define subgroups of privileged users which have higher priority within the group and can run higher numbers of concurrent jobs.  We allow to schedule up to 3 shorter jobs per worker node but stop at 2 if cpu load exceeds 90%.

03/05/2005Batch Scheduling at CERN Resource sharing  CERN experiments and major user groups apply annually for part of centrally funded shared resources and can fund extra capacity giving them a guaranteed share. Any unused shares are available to all. LSF schedules jobs firstly on a per user group basis to deliver the defined shares of cpu time to the group in a rolling 12 hour period.  We have currently defined (arbitrarily) 1800 shares to match our 1800 KSi2K of capacity.  There is currently an allocatable capacity reserve (but which is used in practise) of 300 shares that we use to satisfy short term requests.

03/05/2005Batch Scheduling at CERN Scheduling policies  LSF supports hierarchies of user groups.  Each level of the hierarchy can have a scheduling policy :  Fairshare is used at the highest group level and by most experiments among their users. Jobs are scheduled to give equal shares in average over a certain period – currently 12+ hours. Queue priorities are ignored and more longer jobs may be started.  FCFS (First Come First Served) (default policy at CERN). Shorter queues are scheduled first. Within a group and queue a single user can block others.  Preemptive/Preemptable used at CERN on an engineering cluster only (for parallel jobs)

03/05/2005Batch Scheduling at CERN Current CERN shares per user group  [lxplus058.cern.ch] > 75 ~ > bhpart SHARE  HOST_PARTITION_NAME: SHARE  HOSTS: g_share/  SHARE_INFO_FOR: SHARE/  USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME  u_prod  u_z  zp_prod  u_vp  u_vo  harp_dydak  u_slap  u_za  u_DELPHI  u_wf  u_LHCB  u_HARP  u_c  zp_burst  u_xu  u_CMS  u_vl  u_NA  u_coll  others  u_yt  u_l3c  u_vk  u_OPAL  u_COMPASS  u_zp  u_z

03/05/2005Batch Scheduling at CERN Resource Requirement String  Most LSF commands accept a resource requirement string.  Describes the resources required by a job.  Used for mapping the job onto execution hosts which satisfy the resource request.  Queue names imply a cpu (and real time) time limit and are enough for most users.  Real time limit is hardwired per queue at a multiple (3 to 4) of the cputime limit.

03/05/2005Batch Scheduling at CERN Users can request specific static resources typehost type (SLC3 or LINUX7)string modelhost model (SEIL_2400 etc.)string hnameHostname (lxb0001 etc.)string cpufCPU factor (0.852 etc.)relative serverhost can run remote jobsboolean rexpriexecution prioritynice(2) argument ncpusnumber of processorsinteger ndisksnumber of local disksinteger maxmemmaximum RAM memory available to usersmegabytes maxswpmaximum available swap spacemegabytes maxtmpmaximum available space in /tmp directorymegabytes

03/05/2005Batch Scheduling at CERN Or dynamic resources IndexMeasuresUnitsAveraged over Update Interval statushost statusstring15 sec r15srun queue lengthprocesses15 sec r1mrun queue lengthprocesses1 min15 sec r15mrun queue lengthprocesses15 min15 sec utCPU utilization%1 min15 sec pgpaging activitypgin+pgout per sec 1 min15 sec lsloginsusers30 sec itidle timemin30 sec

03/05/2005Batch Scheduling at CERN Other dynamic Resources IndexMeasuresUnitsAveraged over Update Interval swpavailable swap spaceMB15 sec memavailable memoryMB15 sec tmpavailable space in /tmpMB120 sec iodisk i/o to local diskskB/sec1 min15 sec lftmseconds until next shutdown sec15 sec poolavailable space on /pool disk MB15 sec

03/05/2005Batch Scheduling at CERN Typical batch job requirements  Beyond queue name the most commonly used resource requests are:  “Cpuf > factor” where current values are from 0.2 to 1.0. This is to avoid slower machines with long execution real times.  “Mem > MB” requesting free memory at job start. Avoids jobs running on hosts where they would be killed by our monitors or swap excessively.  “pool > MB” requests current working directory space at job start. We kill jobs that use more than 75% of a nodes work space.  Current grid UI mapping for CPU factor and pool space is unclear

03/05/2005Batch Scheduling at CERN Further information  User level info is on the web at  which has links to live status, accounting data and basic user guides.  Live monitoring information which starts at the cluster level but can descend to individual nodes is at:  (internal access only)