LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010
LCG Topics at September GDB OPN Monitoring APEL CERNVMFS Experiments’ Operational Issues (Quarterly) Others
LCG Missing a central view of LHCOPN HADES data exists (at DFN?) Prototype dashboard Site status is up when OWD between +/-15% from baseline and packet loss less than 0.1% per five minutes Site status is down when packet loss = 100% per five minutes Site status is degraded when measurement values are between a) and b). J. Shade/GDB LHCOPN Update3 Monitoring 08-SEP-2010
LCG J. Shade/GDB LHCOPN Update 4 Prototype Dashboard 08-SEP-2010
LCG J. Shade/GDB LHCOPN Update 5 Prototype Dashboard 08-SEP-2010
LCG DANTE baulked at the idea of developing their prototype further and supporting it SARA and CERN have picked up the gauntlet. An historical view was requested and is foreseen. Questions raised about problem solving procedures. J. Shade/GDB LHCOPN Update6 Monitoring 08-SEP-2010
LCG APEL Update on latest status. Version using ActiveMQ message passing has been in production since June –New node type glite-apel replaces glite-MON. –Performant and reliable Sites encouraged to migrate Anticipate switching off central R-GMA registry at end of Requested WLCG input for EGI/EMI development plans 7
LCG CERNVMFS for Software Servers The stress on shared software servers has been an issue for experiment and site operations over the summer PIC and RAL have tested CERNVMFS as a mechanism for distributing experiment software from CERN to worker nodes. CERNVMFS was developed in OpenLab and has been used to build virtual machine images on demand with experiment software It uses squid caches to bring software to a site on demand and also caches on WN relieving pressure on site servers. Removes the need to run jobs to install software at site. Only caches versions used at that site. Removes duplicate files between and within releases. Initial feedback encouraging. Tests will be scaled up to full site in cooperation with experiments. ATLAS for now but other interested. 8
LCG Experiment Operations Feedback Alice were happy ATLAS raised the issue of disk server reliability. What they measured were the # incidents where a server was out >24 hours. This is a combination of hardware/software reliability and promptness of the site in restoring the service. Scope for standardising responses across Tier1s. –Concerns about ASGC performance CMS interested in CernVMFS work for their Tier3s. –Discussion around information publishing (related to L Field proposal on WLCG Information Officer) 9
LCG Experiment Operations Feedback LHCb have problems with differing configurations at sites. They believe they can adapt their use if they only have enough information. One suggestion would be a Site Card (cf the VO Card) which specified enough information about the site to enable LHCb to automate optimisation of their use. Discussion in the meeting doubted whether this could be automated and suggested one to one discussion with the site as a better route. 10
LCG gLite 3.1 Support Further work on retiring some glite 3.1 services. Glite developers have proposed the end of life of some services. WLCG asked for comment. – EGI Operations will plan with NGIs and their sites taking WLCG views on board. Potential gap in EMI support filled. Specific sites have agreed to continue middleware support of batch systems required by WLCG. This covers support of CE Information Providers, blahd, and APEL parser. 11
LCG Misc. Gstat – –announced new wlcg gstat to be checked by sites. –Gave Ian’s timeline glexec. –New Condor release over summer should address concerns of ATLAS. ATLAS and CMS asked to runs tests again with latest Condor. 12
LCG October GDB Feedback from the DAaMonstrators –What can they show now? –What will they deliver for the end of the year? –Review by panel early in new year. Security Incident response glite 3.1 retiral Installed capacity glexec testing 13