vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice
27/04/2007ATLAS production - J.Devemy / N.Lajili2 Summary Summary A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions
27/04/2007ATLAS production - J.Devemy / N.Lajili3 A few metrics - March 2007 Jobs : – Total submitted jobs : – Total submitted jobs class Long : separate farms : – Pistoo : 56 cpus (for parallel jobs) – Anastasie : 1616 cpus Farm usage : – 62 groups (experiment or laboratory) and 384 users
27/04/2007ATLAS production - J.Devemy / N.Lajili4 A few metrics Jobs Total submitted jobs : Total submitted jobs class Long : = 47 % Memory use for jobs on class Long memory consistently requested : 2 GB 69 % used less than 1.5 GB 29 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : s 97 % of jobs used less than s
27/04/2007ATLAS production - J.Devemy / N.Lajili5 A few metrics Jobs Total submitted jobs : Total submitted jobs class T : = 59 % of total jobs Memory use for jobs on class Long memory consistently requested : 2 GB 86 % used less than 1.5 GB 14 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : s 98 % of jobs used less than s
27/04/2007ATLAS production - J.Devemy / N.Lajili6 Monitoring CC : MRTG Monitoring CC : MRTG Real time production status Green : All ATLAS running jobs at CC-IN2P3 Blue : All LHC running jobs at CC-IN2P3 Orange : ((all LHC running jobs at CC-IN2P3)/(All ATLAS running jobs at CC-IN2P3))*100
27/04/2007ATLAS production - J.Devemy / N.Lajili7 Local job monitoring tool Tools to detect problematic job behaviour : 1. Jobs « slow » : running jobs which do not consume cpu time. 2. Jobs « early ended » : bench of jobs using much less cpu time than requested 3. Alert mails : BQS sends mail to grid site admin in case of job failure 4. Manual check : by running scripts
27/04/2007ATLAS production - J.Devemy / N.Lajili8 Local job monitoring tool
27/04/2007ATLAS production - J.Devemy / N.Lajili9 Concrete actions taken Find a detailed diagnosis of job failures e.g : –Lack of resource, expired proxy, transfers pending, core LCG services unavailable –Job environnement setting for a given VO Find the job Identity – LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in : –mail, GGUS ticket Various tasks for managing the production including : – Jobs could be deleted, locked in queued in case of problem
27/04/2007ATLAS production - J.Devemy / N.Lajili10 Concrete actions taken Increasing VO’s quota – In order to face with intensive computing : DC, MC production.. Create BQS resources – To cope with internal services unavailability (HPSS, dCache) VO agents set up – To regulate automatically job priorities and resources according to the VO requirements
27/04/2007ATLAS production - J.Devemy / N.Lajili11 Issues Sometimes it’s hard to find the user Sometimes very low reactivity from users Users are not well informed about the LCG service status Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Hard to trace jobs which are not submitted through Ressource Brokers Lack of visibility about core LCG services status
27/04/2007ATLAS production - J.Devemy / N.Lajili12 Issues Zombies processes left by ended jobs on the workers nodes – solved in the next BQS version Lack of tools which may allow us to manage VO priorities – solved in a future version (autumn 2007) Memory wasting with jobs submitted on the class Long
27/04/2007ATLAS production - J.Devemy / N.Lajili13 Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the long class Keep us informed as soon as possible about critical production periods
27/04/2007ATLAS production - J.Devemy / N.Lajili14 BQS BQS in a few words… – Home built batch system (10 years old) – Works on all UNIX (GNU/Linux, Solaris, AIX...) – Under continuous evolution, new functionalities are added enabling scalability, robustness, reliability functionality required by users GRID compliant – Very rich scheduling policy including : quotas, resources status, number of queued and running jobs...
27/04/2007ATLAS production - J.Devemy / N.Lajili15 BQS (2) BQS philosophy : – Dispatch of heterogeneous jobs on a worker node – Usage of BQS resources (kind of semaphores) Current developments : – Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS – New BQS servers (to easily absorb the growth of activity)
27/04/2007ATLAS production - J.Devemy / N.Lajili16 Comments / Questions