Download presentation
Presentation is loading. Please wait.
1
Universita’ di Torino and INFN – Torino
Feedback from BaBar Fabrizio Bianchi Universita’ di Torino and INFN – Torino CNAF Peer Review March 2, 2006
2
Outline Introduction Requested Services Hardware Manpower
Status and Usage Difficulties and Comments Conclusions
3
Introduction BaBar computing is now a distributed system:
Real data processing is done at SLAC and Padova. MonteCarlo production is done in ~20 sites. The CNAF LCG farm will be one of them. Physics Analysis is done in 5 Tier1 centers: CNAF, Gridka, IN2P3, RAL and SLAC. Real and MC data are divided into skims defined by analysts. Analysis jobs run on skims and produce root ntuples that are used as input to the final stage of the analysis (fit, plotting, etc.) Physics analysis are grouped into Analysis Working Group. Each AWG is assigned to a Tier1. Charm AWG has been assigned to CNAF.
4
Requested Services Provide adequate CPU resources.
Host the skims requested by the Charm AWG on disk or on a tape library with a highly performant dynamic access (something like HPSS). Expected data size for Run1-5: ~ 150 TB. Import new data with a few days latency. Provide user disk space for ntuple storage. Have installed the appropriate version(s) of the BaBar software and of the third part software. Mirror the Babar databases (bookkeeping, condition, etc.)
5
Hardware Batch nodes: BaBar uses the CNAF batch farm
Servers (mix of P4-Xeon 2.4 and PIII): Objectivity: 5 AMS servers, 2 lockservers Data Import: 3 Shared storage and data servers: 15 (~100 TB of disks) XRootd redirector: 3 NFS server:1 Bookkeeping 2 Front-end: 2 Tape Units: ¼ of data copied on Castor 10 LTO cassettes used for backup of AWG user disk space
6
Mapower Alexis Pompili and Armando Fella work full time for BaBar. Should take care of: Installation of software required by BaBar. Data Import System support should come from CNAF personnel. However: Alexis and Armando do the troubleshooting of the BaBar servers. Alexis has supervised the external technicians coming at CNAF for hardware installation, maintenance and repair. OS installation and maintenance is done by Armando. Alexis is de facto the administrator of BaBar batch queues since 01/06. Some developments (user policy on autologout, locking, X11-forwarding, additional monitoring) are pending since months. My opinion: this is an indication of lack of manpower.
7
Status: hardware Batch capacity is adequate.
Networking is ok for the time being. Event Store (skims): Currently Castor cannot offer a dynamic access to data through XRootd at a useful rate. Skims must be resident on disk. Currently ~ 100 TB are available out of an expected need of 150. There is a plan to acquire the missing disks. Disk space for ntuples is adequate.
8
Status: Data and Software
BaBar (required) software is installed and up to date. Real and MC data are imported within few days. Users can communicate with the support team via the BaBar hypernews system (password protected): Work is in progress to use the CNAF LCG Farm for MC production (D. Andreotti, E. Luppi, A. Fella).
9
Usage 130 BaBar users, 50% not Italians, many from USA.
Batch queues usage: last year Batch queues usage: last month
10
Difficulties I Downtimes.
Quite a whole month in July 05 due to cooling problem. Power failures and electrical work (Aug. 05 and 6-12 Jan. 06). Relevant hardware failures: disks and disk controller: one week in Nov. 05. Fiber Channel Host Bus Adapter failure on a data server on 12/29/05. Lost/corrupted data due to hardware failures.
11
Difficulties II Human errors: Not adequate communication with users.
Raid disks monitoring off. Production server shutdown without notice. Batch queues not reopened after outages (before Alexis was in charge as Administrator of batch queues). Not adequate communication with users. Ports closed, change of backup policy not discussed. There is a weekly CNAF-BaBar meeting. Attendance of CNAF personnel limited to Alexis and Armando since a while. Disclaimer: this is not intended as a criticism to CNAF personnel, but as an evidence of the need of additional manpower.
12
Comments I’m convinced that CNAF personnel is constantly facing a very high workload and that is the cause of the human errors and miscommunications. CNAF support to users is basically nonexistent after 17:30 and over the weekends. This slows the response time especially to the many BaBar users that are located in the USA. Monitoring and automated warning should be improved: failures have been often discovered by users.
13
Conclusions CNAF is successfully used by the Charm AWG. BaBar is going to present at the winter conferences 7 new physics results produced at CNAF. The overall usability of CNAF has clearly improved in the past 6 months. Better hardware monitoring and communication with Alexis, Armando and the BaBar computing management are highly desirable.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.