K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration of access machine - Job handling - Setting up Cambridge as a (small-scale) production centre: Configuration for summer 2002 Problems encountered Future plans
23rd October LHCb distributed production system - Production manager stores details of participating sites in two places: in a Java servlet that produces job scripts in the PVSS system used for job management - Each production site must define and configure an access machine Access machine deals with requests from PVSS, and distributes jobs between all machines available at a site In EDG terms, the access machine acts as a Computing Element, and the machines where jobs are run act as Worker Nodes - When producing job scripts, use Servlet Runner that must have write access to the area where a site’s job scripts are created May be able to use CERN Servlet Runner (afs access), or may need Servlet Runner installed at remote site
23rd October Configuration of access machine - Main steps for configuring the access machine are as follows: Install PVSS tools Define environment variable LHCBPRODROOT to point to root directory of production area Download and run mcsetup installation script Customise site-specific scripts Customisation basically defines site identity, command for job submission, and what to do with output Set up Servlet Runner if not using CERN Servlet Runner More details available at: /datachallenges/slice.doc
23rd October Job handling - Basic job handling is as follows (using CERN Servlet Runner): Specify job request by filling in web form at: /mcbrunel.htm Parameters passed to Servlet Runner, which produces job scripts Submit jobs either through PVSS or locally using script submit-all-scripts installed by mcsetup When jobs are completed, update central database and transfer data to CASTOR using script transfer-all installed by mcsetup
23rd October Cambridge: Summer 2002 (1) - Jobs for summer production were run on 10 desktop machines with Redhat Linux 7.1 installed: 5 x P3 ( GHz, Mb) 5 x P4 ( GHz, Mb) - Desktop machines are used by people who work interactively, and may submit other jobs; production jobs were run on low-priority batch queues Made use of otherwise-idle CPU cycles - Each machine used has Gb local scratch space; additionally had 20 Gb for LHCb production on central file server - LHCb production tools and software were installed only on the access machine - Access machine submitted jobs to an NQS pipe queue, for distribution among all production nodes
23rd October Cambridge: Summer 2002 (2) - A script executed at job startup determined where to run the applications: If the local scratch area had at least 5 Gb free, the LHCb software was copied to a new directory in this area, and run there If there was insufficient free space locally, the LHCb software was copied to a new directory in the LHCb area of the central file server, and run there - When a job completed, its output was stored on the file server, then the directory where the job was run was deleted - Log files and DSTs were copied to CERN, using bbftp and locally written tools
23rd October Cambridge: Problems encountered (1) - Configuration process was very drawn out, as all changes had to be made centrally With new installation tools, site configuration is simpler and almost everything is done locally - Information concerning production not always communicated quickly to sites outside CERN Situtation improved now that lhcb-production mailing list has been set up
23rd October Cambridge: Problems encountered (2) - Had problems during production when afs was unavailable, with sequence as follows: Job fails to retrieve parameter files needed by SICBMC SICBMC complains, but runs anyway Job fails to retrieve options files needed by Brunel Brunel core dumps Large amounts of CPU time wasted (SICBMC producing unusable events); human intervention needed after job crash Problem solved with new system, where reliance on afs is removed - Brunel v13r1 used a lot of memory (around 200 Mb) Some jobs had to be killed as they prevented other users from working Improvements with newer versions of Brunel?
23rd October Cambridge: Future plans - Participation in summer 2002 production has been a positive experience Gained experience with production tools, and with running simulation and reconstruction jobs using the latest versions of the software Produced 37k events that have been copied to CASTOR, and are being used locally in physics studies - Aim to maintain participation in data challenges at least at current (low) level - Additional 20 x P3 (1.1 GHz, 256 Mb) are available in Cambridge HEP Group if we are able to use Grid tools (Globus or EDG) Will be exploring possibilities in the coming months