Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
WMS: some highlights In December 2008 ALICE finished the migration of all sites to a WMS submission approach The instabilities found in the system has forced the experiment and the support to babysit continuously the system and the production This procedure does not scale in a real data taking approach (in few months) ALICE has not changed the submission procedure defined even before 2006 DC IMHO is not the experiment chaging the submission procedure because a new service is not providing the corresponding stability It is the service coping with the experiment requirements and computing model, not the opposite Let’s stop: saying that this issue affectes ALICE only: It is simply NOT TRUE Daily I see similar issues with Geant4, Lattice QCD, sixT. Asking ALICE to change the submission procedure It is not realistic at this point, in addition not see the point of changing one workload management system due to (not well understood) instabilities in a service
ALICE approach ALICE requires deployment of the CREAM-CE at all sites This is the highest priority Sites might be excluded of the production if the service is not provided The experiment therefore will not maintain a new submission procedure for some months Intermedium time from WMS to CREAM In addition both systems must be maintain together bulk submission is not supported to the CLI level yet by CREAM It is not realistic to have 2 submission approaches at this time by NONE application
Status of the WMS in production Distribution of WMS in the ALICE production For T0 site Optimal situation: 3 WMS covering the production and the Pass 1 reconstruction at the T0 only The reality: Each node has achieved a limit of 13K jobs/day (confirmed by the WMS operation experts). In addition these nodes have to cope with the instabilities of external WMS For T1 sites Optimal situation: Each T1 site should provide at least 2 WMS which should be dedicated in the case of many depending T2 sites in the country The reality: This affects basically Italy and France and it is ensured by Italy For T2 sites Optimal situation:Large federations WITHOUT a regional T1 should follow the structure asked for the T2 sites (case of Russia) The reality: the available T1 WMS must fly from one T2 to another depending on the daily overload status
Some trues and some lies about the ALICE Submission procedure and the WMS The latest WMS mega-patch solves the overloding issues observed in gLite3.0: FALSE We have not seen huge backlogs anymore: TRUE The ALICE submission procedure has changed in the last time producing the instabilities observed in some WMS: FALSE The experiment tried to accomodate as much as possible the submission procedure to WMS within their own computing model limits: TRUE Same WMS configuration file as in Proxy renewal trigered only once per hour RESUBMISSION FEATURE OF THE WMS DISCARTED BY THE EXPERIMENT AT THE JDL LEVEL SINCE FEB2009 ALICE is therefore using the WMS to a tree level (RB mode) All the rest of the features are simply not used and not required
WHAT WAS HAPPENING IN FRANCE? Issues in GRIF and CCIN2P3 are totaly uncorrelated GRIF grid33.lal.in2p3.fr got overloaded yesterday In addition it was announced that ALICE was overloading the CE Resubmission approach was discarted Number of jobs not visible in the IS not the LB (later on) CCIN2P3 This is the unique VO supporting CE in the T1 and T2 CEs with different ranks This situation was fulfilling one CE (best ranking) leaving the rest of CE empty The query to the info system was providing 0 waiting jobs for those (worse ranking) CE and therefore the system kept on submitting jobs T1 and T2 clisters will be separated in different VOBOXES
Status of the CREAM-CE New sites providing CREAM-CE: RU-SPbSU (under testing) Prague (still to be tested) Subatech (still to be tested) Already existing sites with production infrastructures: FZK (just upgraded to the next version) Kolkata (performing fine) KISTI (no issues) GSI (pending the setup in production) RAL (no issues) CNAF (no issues) CERN (moving the system from SLC5 to SLC4 to increase the number of resources) Torino (no issues) SARA (no issues)