Download presentation
Presentation is loading. Please wait.
Published byPeter Watson Modified over 9 years ago
1
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) GridPP21 3 rd September 2008
2
2 LHCb computing model ➟ CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data and DSTs All T1s have a full copy of dst-s ➟ Simulation at all possible sites (CERN, T1, T2) LHCb has used about 120 sites on 5 continents so far ➟ Reconstruction, Stripping and Analysis at T0 / T1 sites only Some analysis may be possible at “large” T2 sites in the future ➟ Almost all the computing (except for development / tests) will be run on the grid. Large productions : production team Ganga ( Dirac ) grid user interface
3
3 LHCb on the grid Small amount of activity over past year ▓ DIRAC3 has been under development ▓ Physics groups have not asked for new productions ▓ Situation has changed recently...
4
4 LHCb on the grid ➟ DIRAC3 Nearing stable production release ▓ Extensive experience with CCRC08 and follow-up exercises ▓ Used as THE production system for LHCb Now testing of the interfaces by Ganga developers ➟ Generic pilot agent framework Critical problems found with the gLite WMS 3.0, 3.1 ▓ Mixing of VOMS roles under certain reasonably common conditions Cannot have people with different VOMS roles! ▓ Savannah bug #39641 ▓ Being worked on by developers Waiting for this to be solved before restarting tests
5
5 DIRAC3 Production >90,000 jobs in past 2 months Real production activity and testing of gLite WMS
6
6 DIRAC3 Job Monitor https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displa y
7
7 LHCb storage at RAL ➟ LHCb storage primarily on the Tier-1s and CERN ➟ CASTOR used as storage system at RAL Fully moved out of dCache in May 2008 ▓ One tape damaged and file on it marked lost Was stable (more or less) until 20 Aug 2008 ▓ Not been able to take great load on servers Low upper limit (8) on lsf job slots on various castor diskservers Too many jobs (>500) can come into the batch system. The concerned service class hangs then Temporarily fixed for now. Needs to be monitored (probably by the shifter on duty?) »Increase limit to >100 rfio jobs per server »Not all hardware can handle a limit of 200 jobs (start using swap space) Problem seen many times now over the last few months ▓ Castor now in downtime ▓ This is worrying given how close we are to data taking
8
8 LHCb at RAL ➟ Move to srm-v2 by LHCb Needed to retire srm-v1 endpoints, hardware for RAL When DIRAC3 becomes baseline for User analysis ▓ Already used for almost all production ▓ Ganga working on submitting through DIRAC3 ▓ Needs LHCb also to rename files in the LFC All space tokens, etc have been setup Target : Turn off srm-v1 access by end September ➟ Currently use srm-v1 for user analysis ▓ DIRAC2 does not support srm-v2 ➟ Batch system : Pausing of jobs during downtime? ▓ Not clear about the status of this For now, stop the batch system from accepting LHCb jobs a few hours before scheduled downtimes ▓ No LHCb job should run for >24 hours Announce beginning and end of downtimes ▓ Problems with broadcast tools ▓ GGUS ticket opened by Derek Ross
9
9 LHCb and CCRC08 ➟ Planned tasks : Test the LHCb computing model Raw data distribution from pit to T0 centre ▓ Use of rfcp into CASTOR from pit - T1D0 Raw data distribution from T0 to T1 centres ▓ Use of FTS - T1D0 Recons of raw data at CERN & T1 centres ▓ Production of rDST data - T1D0 ▓ Use of SRM 2.2 Stripping of data at CERN & T1 centres ▓ Input data: RAW & rDST - T1D0 ▓ Output data: DST - T1D1 ▓ Use SRM 2.2 Distribution of DST data to all other centres ▓ Use of FTS
10
10 LHCb and CCRC08 Reconstruction Stripping
11
11 LHCb CCRC08 Problems ➟ CCRC08 highlighted areas to be improved File access problems ▓ Random or permanent failure to open files using gsidcap Request IN2P3 and NL-T1 to allow dcap protocol for local read access Now using xroot at IN2P3 – appears to be successful ▓ Wrong file status returned by dCache SRM after a put bringOnline was not doing anything Software area access problems ▓ Site banned for a while until problem is fixed Application crashes ▓ Fixed with new SW release and deployment Major issues with LHCb bookkeeping ▓ Especially for stripping ➟ Lessons learned Better error reporting in pilot logs and workflow Alternative forms of data access needed in emergencies ▓ Downloading of files to WN (used at IN2P3, RAL)
12
12 LHCb Grid Operations ➟ Grid Operations and Production team has been created
13
13 Communications ➟ LHCb sites Grid operations team keep track of problems Report to sites via GGUS and eLogger ▓ All posts are reported on lhcb-production@cern.chlhcb-production@cern.ch ▓ Please subscribe if you want to know what is going on ➟ LHCb users Mailing lists ▓ lhcb-distributed-analysis@cern.ch All problems directed here ▓ Specific lists for each LHCb application and Ganga Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps ▓ User by developers and “power” users Software weeks provide training sessions for using Grid tools Weekly distributed analysis meetings (starts Friday) ▓ DIRAC, Ganga, core software developers along with some users ▓ Aims to identify needs and coordinate release plans http://lblogbook.cern.ch/Operations RSS feed available http://lblogbook.cern.ch/Operations
14
14 Summary ➟ Concerned about CASTOR stability close to data taking ➟ DIRAC3 workload and data management system now online Has been extensively tested when running LHCb productions Now moving it into the user analysis system ▓ Ganga needs some additional development ➟ Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently ➟ LHCb looking forward to imminent switch on of the LHC!
15
15 Backup - CCRC08 Throughput
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.