L’analisi in LHCb Angelo Carbone INFN Bologna
Introduction The analysis in LHCb is handled by GANGA an Atlas/LHCb project enabling a user to perform the complete life cycle of a job Build – Configure – Prepare – Monitor – Submit – Merge – Plot It allows to run jobs on the local machine, either interactive or in background on batch systems (LSF, PBS, …) on the Grid Jobs look the same whether the run locally or on the Grid Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
LHCb jobs For LHCb the main use of Ganga is for running Gaudi jobs This means: Configure analysis applications Specify the datasets Split and submit the jobs Managing the output data Merge n-tuples and histogram files Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Application There is a specific application handler for each Gaudi app: ['Brunel', 'Moore', 'DaVinci‘, 'Gauss', 'Boole‘, Root,…] # Define a DaVinci application object d = DaVinci() d.optsfile = d.user_release_area + ’myopts.py' ApplicationMgr().EvtMax = 1000 HistogramPersistencySvc().OutputFile = "DVHistos_1.root“ myopts.py include the configuration of the user analysis Algorithms, variable cuts, input data sets, etc… Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Backend There are 4 backends of interest for running LHCb jobs: Interactive – in the foreground on the client Local – in the background on the client LSF – on the LSF batch system (SGE/PBS/Condor systems supported as well) Dirac – on the Grid # Define a Dirac backend object d = Dirac() print d Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Access to the Grid User sends job to DIRAC WMS sends a pilot agent as a WLCG job When pilot agent runs safely on a worker node it fetches job from DIRAC Small data files returned in the sendbox Large files registered in LFC file catalogue User queries DIRAC for the status and finally retrieves the output Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Input dataset Use the LHCb bookkeeping to get a list of files to run over j.inputdata = browseBK() # opens BK browser Only LFN are accessible Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Output Dataset When a job is finished the output dir will contain the stdout and stderr of the job and your output sandbox files. Output data files are stored in a storage element on the Grid. Large files are uploaded to a storage element - Download with j.backend.getOutputData You can build a list of LFNs of these files – j.backend.getOutputDataLFNs Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
The Ganga job object Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Job splitting and data drive submission Splitter main Job List of LFN catalog LFC Job splitting and data drive submission GANGA List of PFN sub-jobs sub-jobs sub-jobs CNAF RAL CERN IN2P3 GRIDKA PIC NIKHEF Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Merging Jobs produce lots of output files that need to be merged together to obtain final results Different file merging root RootMerger text TextMerger DST DSTMerger Want something really special? CustomMerger Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Monitoring Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Ganga End-Users Over 1000 unique users in the past 6 months: Dip caused by monitoring outage Over 1000 unique users in the past 6 months: Generally 50% ATLAS (blue), 25% LHCb (green), 25% other Monthly ~500 unique ~2000 unique since January 2007 Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Job efficiency Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Failure Data access failure (19%). There are two main causes for the 19% jobs failing to access input data from the WN. The first is due to instability in the site SRM layer at the Tier-1 sites. not being able to construct TURLs for the software application t access input datasets The other cause of such problems are zero-size or incorrectly registered dataset replicas for which it is impossible to obtain a correct TURL. Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Stalled Stalled (8%) A job is ‘stalled’ if the Job Monitoring Service stops receiving signal of life One of the main causes of this is user proxy expiration on the WN. Submitted Pilot Agents may wait in a site batch queue for several hours, which is a significant portion of a default (12 hour) proxy validity. application failure loss of open data connections at sites and also user code crashes, all of which can result in expending the available wall-clock time of the resource. Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Other minor failures Failed to upload output data (1%) This caused by the transfer and register operation to the LFC failing. It can happen due to network outages, power cuts, site mis-configurations, and also during LFC downtime. Application failure (1%) The Job Wrapper can identify the exit state of the software applications running on the Grid. A common cause of this type of failure is corrupted software shared-areas at the sites. Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone
Conclusion The LHCb distributed analysis framework allows users to transparently submit jobs to the Grid Real job efficiency measured so far ~70% Main source of failures data inconsistencies service instabilities Although usable (and used), GRID analysis for LHCb is not yet at production quality Still far from 99.99999999999%... Workshop CCR e INFN-GRID 2009 13rd May 2009 - Angelo Carbone