Download presentation
Presentation is loading. Please wait.
Published byLaurence Lawrence Modified over 8 years ago
1
vendredi 27 avril 2007 Management of ATLAS jobs @ CC-IN2P3 Specificities, issues and advice
2
27/04/2007ATLAS production - J.Devemy / N.Lajili2 Summary Summary A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions
3
27/04/2007ATLAS production - J.Devemy / N.Lajili3 A few metrics - March 2007 Jobs : – Total submitted jobs : 400 000 – Total submitted jobs class Long : 200 000 2 separate farms : – Pistoo : 56 cpus (for parallel jobs) – Anastasie : 1616 cpus Farm usage : – 62 groups (experiment or laboratory) and 384 users
4
27/04/2007ATLAS production - J.Devemy / N.Lajili4 A few metrics - ATLAS@March 2007 Jobs Total submitted jobs : 52 000 Total submitted jobs class Long : 28 000 = 47 % Memory use for jobs on class Long memory consistently requested : 2 GB 69 % used less than 1.5 GB 29 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 97 % of jobs used less than 800 000 s
5
27/04/2007ATLAS production - J.Devemy / N.Lajili5 A few metrics - ATLAS@April 2007 Jobs Total submitted jobs : 62000 Total submitted jobs class T : 36000 = 59 % of total jobs Memory use for jobs on class Long memory consistently requested : 2 GB 86 % used less than 1.5 GB 14 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : 2 000 000 s 98 % of jobs used less than 800 000 s
6
27/04/2007ATLAS production - J.Devemy / N.Lajili6 Monitoring production @ CC : MRTG Monitoring production @ CC : MRTG Real time production status Green : All ATLAS running jobs at CC-IN2P3 Blue : All LHC running jobs at CC-IN2P3 Orange : ((all LHC running jobs at CC-IN2P3)/(All ATLAS running jobs at CC-IN2P3))*100
7
27/04/2007ATLAS production - J.Devemy / N.Lajili7 Local job monitoring tool Tools to detect problematic job behaviour : 1. Jobs « slow » : running jobs which do not consume cpu time. 2. Jobs « early ended » : bench of jobs using much less cpu time than requested 3. Alert mails : BQS sends mail to grid site admin in case of job failure 4. Manual check : by running scripts
8
27/04/2007ATLAS production - J.Devemy / N.Lajili8 Local job monitoring tool
9
27/04/2007ATLAS production - J.Devemy / N.Lajili9 Concrete actions taken Find a detailed diagnosis of job failures e.g : –Lack of resource, expired proxy, transfers pending, core LCG services unavailable –Job environnement setting for a given VO Find the job Identity – LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in : –mail, GGUS ticket Various tasks for managing the production including : – Jobs could be deleted, locked in queued in case of problem
10
27/04/2007ATLAS production - J.Devemy / N.Lajili10 Concrete actions taken Increasing VO’s quota – In order to face with intensive computing : DC, MC production.. Create BQS resources – To cope with internal services unavailability (HPSS, dCache) VO agents set up – To regulate automatically job priorities and resources according to the VO requirements
11
27/04/2007ATLAS production - J.Devemy / N.Lajili11 Issues Sometimes it’s hard to find the user email Sometimes very low reactivity from users Users are not well informed about the LCG service status Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Hard to trace jobs which are not submitted through Ressource Brokers Lack of visibility about core LCG services status
12
27/04/2007ATLAS production - J.Devemy / N.Lajili12 Issues Zombies processes left by ended jobs on the workers nodes – solved in the next BQS version Lack of tools which may allow us to manage VO priorities – solved in a future version (autumn 2007) Memory wasting with jobs submitted on the class Long
13
27/04/2007ATLAS production - J.Devemy / N.Lajili13 Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the long class Keep us informed as soon as possible about critical production periods
14
27/04/2007ATLAS production - J.Devemy / N.Lajili14 BQS BQS in a few words… – Home built batch system (10 years old) – Works on all UNIX (GNU/Linux, Solaris, AIX...) – Under continuous evolution, new functionalities are added enabling scalability, robustness, reliability functionality required by users GRID compliant – Very rich scheduling policy including : quotas, resources status, number of queued and running jobs...
15
27/04/2007ATLAS production - J.Devemy / N.Lajili15 BQS (2) BQS philosophy : – Dispatch of heterogeneous jobs on a worker node – Usage of BQS resources (kind of semaphores) Current developments : – Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS – New BQS servers (to easily absorb the growth of activity)
16
27/04/2007ATLAS production - J.Devemy / N.Lajili16 Comments / Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.