GDB - February 2014 Summary Jeremy’s notes Agenda:

Slides:



Advertisements
Similar presentations
Operations Coordination Team Maria Girone, CERN IT-ES Kick-off meeting 24 th September 2012.
Advertisements

LCG-France Project Status Fabio Hernandez Frédérique Chollet Fairouz Malek Réunion Sites LCG-France Annecy, May
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
CERN IT Department CH-1211 Geneva 23 Switzerland t T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP35, Liverpool 11 Sep 2015.
The production deployment of IPv6 on WLCG David Kelsey (STFC-RAL) CHEP2015, OIST, Okinawa 16 Apr 2015.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.
GDB July 2015 Jeremy’s quick summary notes Also refer to the meeting minutes
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
LCG Introduction John Gordon, STFC GDB June 8 th 2011.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
December GDB Brief summary – J Coles. Meetings January meeting moved to 15 th 2014 events created. Check March meeting outside CERN. Copenhagen workshop.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG Operations Coordination Andrea Sciabà IT/SDC 10 th July 2013.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
WLCG and IPv6 David Kelsey (STFC-RAL) LHCOPN/LHCONE, Rome 28 Apr 2014.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons Maarten Litmaath On behalf of the WG participants GDB 09/09/2015.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April
WLCG Operations Coordination and Commissioning Maria Girone, CERN IT On behalf of the Operations Coordination Team 11 th March OSG All Hands Meeting,
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
WLCG IPv6 deployment strategy
WLCG Workshop 2017 [Manchester] Operations Session Summary
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
WLCG Network Discussion
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
WLCG Operations Coordination
Update on Plan for KISTI-GSDC
Update from the HEPiX IPv6 WG
WLCG and support for IPv6-only CPU
WLCG Collaboration Workshop;
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

GDB - February 2014 Summary Jeremy’s notes Agenda:

Introduction (M Jouvin) Please check 2014 meeting dates. March 12 th – CNAF Bologna (register) WLCG workshop (1 st /2 nd week July). Barcelona. Possibly 8 th -9 th July. GDB actions: Future (pre-)GDB topics welcome Upcoming: By introducing a pay‐per‐usage scheme as part of funding model the funding agencies will have the information to be able to measure the level of usage of a service and whether it justifies their investments. In addition, if the pay‐per‐usage model is implemented by giving some of the financial control to the users then they will favour those services which offer better value‐propositions. Site Nagios testing – any feedback? OSG Federation workshop: HEPiX May 19-23rd May. Annecy: EGI CF 19-23rd May. Helsinki.

HEP SW Collaboration (I Bird) Performance now a limiting factor. CPU technology trends. More transistors but not easy to use them. Most s/w designed for sequential processing. Migrating to multi- threaded not easy. Target geant and root. Concurrency Forum est. 2yrs ago. Towards Open Scientific Software Initiative. Components such as Geant and ROOT should be part of a modular infrastructure. HEP S/W Collab: goal to build /maintain libraries… Establish a formal collaboration to develop open scientific software packages guaranteed to work together (inc. frameworks to assemble apps). Workshop 3 rd -4 th April 2014

IPv6 Update (D Kelsey) WG meeting 23/24 Jan 2014 (included CERN cloud and OpenStack.) Progress in various areas. CERN campus wide deployment in March (some dhcpv6 issues): PerfSONAR very useful… works IC. Run dual stack? IPv6 file transfer test bed. Decayed a bit. ATLAS testing (Alastair): AGIS. Simple tests then HC. Squid 2.8 not IPv6 compatible. Plan to get mesh working again. Site deployments. Move to use SRM/FTS… Define use-cases Barrier to move for some sites if availability affected going to dual stack etc. Software survey shows 15/66 ‘services’ known to be fully compliant. Pre-release of dCache has IPv6 fixes. Want to survey sites – when will they run out of IPv4 and be capable of IPv6. pre- GDB meeting in June.

Future of SLC (J Polok) CentOS team joining Red Hat in open standards team. Not RHEL. CentOS Linux platform is not changing Impact for SL5/6: Source packages may have to be generated from git repositories. No other changes – releases stay as now SL(C) 7 options being discussed May rebuild from source as for 5 and 6 OR create a Scientific Centos variant OR adopt Centos core. Approaches: 1. Keep process: build from source with our actual tool chain. 2. Create SIG for our variant. 3. SL become an add-on repository to CentOS core. Centos 7 Beta in preparation. RHEL7 production due in summer. Source RPMs not guaranteed after summer. Need to ensure risks for 5 and 6 covered.

Ops coordination report (S. Campana) Input based on pre-GDB Ops Coordination meeting. gLexec: CMS SAM test not yet critical. Still 20 sites have not deployed. perfSONAR: It is a service. Site w/o or at an old release will feature in report(s) to MB. Tracking tools evolution – Savannah to JIRA. JIRA still lacking GGUS some functionality SHA-2 migration: progress with VOMS-admin but manual process needed. New host certs soon. Machine/Job features: Prototype ready. Options for clouds being looked at. Middleware Readiness: Model will rely on experiments & frameworks + sites deploying test instances + monitoring. MB will discuss process for ‘rewarding’ site participation.

Ops Coordination - cont Baseline enforcement: Looking at options to monitor and then automate for campaigns WMS decommissioning: Shared/CMS instances end in April. SAM will use till June. Multi-core deployment: ATLAS & CMS different usage. Trying prototypes. Torque/Maui a concern. FTS3 deployment: FTS3 works well. Few instances needed – 3 or 4 for resilience. Experiment Computing Commissioning: Experiment plans for 2014 discussed. Conclude no need for common commissioning exercise. Conclusion – some deployment areas being escalated.

High memory jobs (J Templon) NIKHEF observations Which high mem problem!? Virtual memory usage in GB. Pvmem 4096MB. User jobs and some prod jobs high usage. These don’t ‘ask’ for the memory. Link multi-core and high mem. Pvmem – ulimit on process – allows handling of out-of-mem signal (not kill) Different ways to ask for more memory in job… few work. Inconsistencies arise. Situation being reviewed.

SAM test scheduling (Luca Magnoni) SAM: framework to schedule checks (Nagios) via dedicated plug-in (probes = scripts/executables) Categories: Public grid/cloud services (custom probes); job submission (via WMS); WNs (via job payloads). Job submission – to include direct CREAM and condor-G Remote testing assumes deterministic execution. There are granularity issues (CE vs site) and not always agreement between site and experiment views. Can test with different credentials. Jobs can timeout whe VO out of share. Site availability determined by experiment critical profiles. Most timeouts looked to be on WMS side! New Condor-G and CREAM probes for job submission coming Aim to provide web UI/API for users Looking at options to replace Nagios for scheduler Test submission via other frameworks (e.g. HC) being investigated – ATLAS want a hybrid approach, CMS do not support framework approach.

New transfer dashboard (A Beche) Reviewed history of data transfer monitoring. Separate web API/UI for FTS, FAX, AAA. Added in ALICE and EOS. Plan to federate. Data split into schemas: FTS, XRootD and high optimization. Data retention policies differ – raw and statistics Dashboard now aggregates over APIs Plan for a map view

WLCG monitoring coordination (Pablo Saiz) Consolidation group: reduce complexity; modular design; simplify ops and support; common dev and core. Need more site input. Timeline – starting to deploy. Survey & tasks. Tasks in JIRA: 1. Application support (for jobs, transfers, infrastructure…) 2. Running the services (moving to AI, Koji, SL6, puppet…) 3. Merging applications (SSB+SAM; SSB+REBUS; HC+Nagios…). Idea is to reduce to make maintenance easier. Many infrastructure monitoring tools - schema copes with several use-cases Technology evaluation Nagios plug-in for sites developed by PIC SAM/SUM -> SAM3 (for SUM background see ) Next steps:

Data Preservation Update (J Shiers) Things are going well. Workshop. Increasing archive growth. Annual cost of WLCG is 100M euro. Need 4 staff: documentation; standards;.. DPHEP portal core. Digital library. Sustainable software+virtualisation tech+validation frameworks. Sustainable funding. Open data.

LHCOPN/LHCONE evolution workshop (E. Martelli) Networking stable. Key. Growth with technology evolution ok. New sites in areas where network under-developed. ATLAS: Expect bursty traffic. US sites-> 40/100 CMS: Mesh will increase traffic. LHCb: no specific concerns. More bandwidth needed at T2s. Connectivity needs to improve to Asia – capacity and rtts. Demands for better network monitoring & LHCONE operations. P2P-link-on-demand (over provisioning vs complexity (L3VPN))

perfSONAR (Shawn McKee) Sites to use “mesh” configuration Metrics will adjust over time 85% sites with PS have issues to resolve (firewalls, versions…). Likely go with MaDDash (Monitoring and Debugging Dashboard) Checking of primitive services – OMD (Open Monitoring Distribution) For test instance…. WLC*** WLC*** Context between all sites … release will mean only one machine needed Alerting – high-priority but complicated