Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.

Similar presentations


Presentation on theme: "WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008."— Presentation transcript:

1 WLCG ‘Weekly’ Service Report Harry.Renshall@cern.ch ~~~ WLCG Management Board, 19 th August 2008

2 Introduction This ‘weekly’ report covers two weeks (MB summer schedule) Last week (4 to 10 August): This week(11 August to 17 August): Notes from the daily meetings can be found from: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings (Some additional info from CERN C5 reports & other sources) Systematic remote participation by BNL, RAL and PIC (and increasingly NL-T1). Will encourage more as holiday season ends and startup approaches. 2

3 Site Reports (1/2) CERN: Following migration of the Voms registration service to new hardware on 4 August new registrations were not synchronised properly until late on 5 August due to a configuration error (the synchronisation was pointing to a test instance). Only one dteam user was affected however. CASTOR upgrades to level 2.1.7-14 were completed for all the LHC experiments. Installation of the Oracle July critical patch update has been completed in most production databases with only minor problems (transparent to the endusers). RAL: Many separate problems/activities have lead to periods of non-participation in experiment testing (migration dcache to castor, upgrade to castor 2.1.7). Some delays due to experts being on vacation (SRM). Several disk full problems (stager LSF and a backend database). BNL: The long standing problem of unsuccessful automatic failover of the primary network links from BNL to TRIUMF and PIC to the secondary route over the OPN is thought to be understood and resolved. CERN will participate in some further testing. On 14 August found user analysis jobs doing excessive metadata lookups caused pnfs slowdown causing srm put/get timeouts and hence low transfer efficiencies. Could be a potential problem at many dcache sites. 3

4 Site Reports (2/2) PIC: On Friday 8 August PIC reported having overload problems of their PBS master node with timing out errors affecting the stability of their batch system. The node had been replaced on Tuesday so they decided to revert to the previous master node (done Thursday evening) but this also showed the same behaviour. They suspected issues in the configuration of the CE that is forwarding jobs to PBS and scheduled a 2 hour downtime on the following Monday morning to investigate. FZK: There was a problem with LFC replication to Gridka for LHCb from 31-7 till 6-8. Replication stopped shortly after the upgrade to Oracle 10.2.0.4. Propagation from CERN to Gridka would end with an error (TNS error of connection dropped). At the same time Gridka DBAs reported several software crashes on cluster node 1 and since 6-8 we are using Gridka cluster node N.2 as destination. Further investigations are being performed by Gridka DBAs to see whether the network/communication issues with node N.1 are still present. General: LFC 1.6.11 (which fixes the periodic LFC crashes) has passed through the PPS and will be released to production today or tomorrow. We have received a partial fix to the vdt limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and should help ALICE. 4

5 Experiment reports (1/3) ALICE: Nothing special to report. LHCb: Have published to EGEE the requirement on sites to support multi-user pilot jobs – basically to support a new role=pilot Several CERN VOboxes suffered occasional (each few days) kernel panic crashes with a known kernel issue. Now fixed with this weeks kernel upgrades. An issue has arisen (ongoing at NL-T1) over the use of pool accounts for the software group management function. These require the VO software directory to be group writeable. If the sgm pool accounts are a different unix group to the other LHCb accounts they would also have to be world read. Currently affecting some SAM tests. 5

6 Experiment reports (2/3) CMS: During the last two weeks continued the pattern of a 1.5 day global cosmics run Wednesday to Thursday. Failure of network switch to building 613 at 17.00 on 12 August stopped AFS-cron service which stopped some SAM test submission and monitoring information going into SLS. Not fixed till 10.00 the next day. Have been preparing for their CRUZET4 cosmics run which in fact is not ZEro Tesla but magnet on. Run started 18 August and will last a week. Requested for this a new CERN elog instance which exposed a dependency in this service on a single staff member currently on holiday. Preparing 21 hour computing support shifts coordinated across CERN and FNAL. Remaining period to be covered by ASGC. 6

7 Experiment reports (3/3) ATLAS: Over weekend of 9/10 August ran a 12 hours cosmics run with only one terminating luminosity block. This splits into 16 streams of which 5-6 were particularly big resulting in unusually high rates to the receiving sites (with success). AFS-cron service failure at 17.00 on 12 August stopped functional tests till 10.00 next day (an unexpected dependence). During weekend of 16/17 August changed embedded cosmics data type name in raw data files to magnet-on without warning sites resulting in some data going into the wrong directories. Now performing 12 hour throughput tests (full nominal rate) each Thursday from 10.00 to 22.00. Typically a few hours to ramp up and down to/from the full rates. Outside of this running functional tests at 10% nominal rate and cosmics at weekends and as/when scheduled by detector teams. Will start 24 by 7 computing support shifts at end August. 7

8 Summary 8 Detector driven stress tests increasing. Many miscellaneous failures but we seem to have enough ‘elastic’ in the system to recover from them.


Download ppt "WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008."

Similar presentations


Ads by Google