1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008

2 LHCb computing model ➟ CERN (Tier-0) is the hub of all activity  Full copy at CERN of all raw data and DSTs  All T1s have a full copy of dst-s ➟ Simulation at all possible sites (CERN, T1, T2) ‏  LHCb has used about 120 sites on 5 continents so far ➟ Reconstruction, Stripping and Analysis at T0 / T1 sites only  Some analysis may be possible at “large” T2 sites in the future ➟ Almost all the computing (except for development / tests) will be run on the grid.  Large productions : production team  Ganga ( Dirac ) grid user interface

3 LHCb on the grid  Small amount of activity over past year ▓ DIRAC3 has been under development ▓ Physics groups have not asked for new productions ▓ Situation has changed recently...

4 LHCb on the grid ➟ DIRAC3  Nearing stable production release ▓ Extensive experience with CCRC08 and follow-up exercises ▓ Used as THE production system for LHCb  Now testing of the interfaces by Ganga developers ➟ Generic pilot agent framework  Critical problems found with the gLite WMS 3.0, 3.1 ▓ Mixing of VOMS roles under certain reasonably common conditions  Cannot have people with different VOMS roles! ▓ Savannah bug #39641 ▓ Being worked on by developers  Waiting for this to be solved before restarting tests

5 DIRAC3 Production >90,000 jobs in past 2 months Real production activity and testing of gLite WMS

6 DIRAC3 Job Monitor https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displa y

7 LHCb storage at RAL ➟ LHCb storage primarily on the Tier-1s and CERN ➟ CASTOR used as storage system at RAL  Fully moved out of dCache in May 2008 ▓ One tape damaged and file on it marked lost  Was stable (more or less) until 20 Aug 2008 ▓ Not been able to take great load on servers  Low upper limit (8) on lsf job slots on various castor diskservers  Too many jobs (>500) can come into the batch system. The concerned service class hangs then  Temporarily fixed for now. Needs to be monitored (probably by the shifter on duty?) ‏ »Increase limit to >100 rfio jobs per server »Not all hardware can handle a limit of 200 jobs (start using swap space)‏  Problem seen many times now over the last few months ▓ Castor now in downtime ▓ This is worrying given how close we are to data taking

8 LHCb at RAL ➟ Move to srm-v2 by LHCb  Needed to retire srm-v1 endpoints, hardware for RAL  When DIRAC3 becomes baseline for User analysis ▓ Already used for almost all production ▓ Ganga working on submitting through DIRAC3 ▓ Needs LHCb also to rename files in the LFC  All space tokens, etc have been setup  Target : Turn off srm-v1 access by end September ➟ Currently use srm-v1 for user analysis ▓ DIRAC2 does not support srm-v2 ➟ Batch system :  Pausing of jobs during downtime? ▓ Not clear about the status of this  For now, stop the batch system from accepting LHCb jobs a few hours before scheduled downtimes ▓ No LHCb job should run for >24 hours  Announce beginning and end of downtimes ▓ Problems with broadcast tools ▓ GGUS ticket opened by Derek Ross

9 LHCb and CCRC08 ➟ Planned tasks : Test the LHCb computing model Raw data distribution from pit to T0 centre ▓ Use of rfcp into CASTOR from pit - T1D0 Raw data distribution from T0 to T1 centres ▓ Use of FTS - T1D0 Recons of raw data at CERN & T1 centres ▓ Production of rDST data - T1D0 ▓ Use of SRM 2.2 Stripping of data at CERN & T1 centres ▓ Input data: RAW & rDST - T1D0 ▓ Output data: DST - T1D1 ▓ Use SRM 2.2 Distribution of DST data to all other centres ▓ Use of FTS

10 LHCb and CCRC08 Reconstruction Stripping

11 LHCb CCRC08 Problems ➟ CCRC08 highlighted areas to be improved  File access problems ▓ Random or permanent failure to open files using gsidcap  Request IN2P3 and NL-T1 to allow dcap protocol for local read access  Now using xroot at IN2P3 – appears to be successful ▓ Wrong file status returned by dCache SRM after a put  bringOnline was not doing anything  Software area access problems ▓ Site banned for a while until problem is fixed  Application crashes ▓ Fixed with new SW release and deployment  Major issues with LHCb bookkeeping ▓ Especially for stripping ➟ Lessons learned  Better error reporting in pilot logs and workflow  Alternative forms of data access needed in emergencies ▓ Downloading of files to WN (used at IN2P3, RAL) ‏

12 LHCb Grid Operations ➟ Grid Operations and Production team has been created

13 Communications ➟ LHCb sites  Grid operations team keep track of problems  Report to sites via GGUS and eLogger ▓ All posts are reported on lhcb-production@cern.chlhcb-production@cern.ch ▓ Please subscribe if you want to know what is going on ➟ LHCb users  Mailing lists ▓ lhcb-distributed-analysis@cern.ch  All problems directed here ▓ Specific lists for each LHCb application and Ganga  Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps ▓ User by developers and “power” users  Software weeks provide training sessions for using Grid tools  Weekly distributed analysis meetings (starts Friday) ‏ ▓ DIRAC, Ganga, core software developers along with some users ▓ Aims to identify needs and coordinate release plans http://lblogbook.cern.ch/Operations RSS feed available http://lblogbook.cern.ch/Operations

14 Summary ➟ Concerned about CASTOR stability close to data taking ➟ DIRAC3 workload and data management system now online  Has been extensively tested when running LHCb productions  Now moving it into the user analysis system ▓ Ganga needs some additional development ➟ Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently ➟ LHCb looking forward to imminent switch on of the LHC!

15 Backup - CCRC08 Throughput

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Similar presentations

Presentation on theme: "1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Similar presentations

Presentation on theme: "1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008."— Presentation transcript:

Similar presentations

About project

Feedback