The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

LCG-France Project Status Fabio Hernandez Frédérique Chollet Fairouz Malek Réunion Sites LCG-France Annecy, May
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Assessment of Core Services provided to USLHC by OSG.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Security Vulnerabilities Linda Cornwall, GridPP15, RAL, 11 th January 2006
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Jürgen Knobloch/CERN Slide 1 A Global Computer – the Grid Is Reality by Jürgen Knobloch October 31, 2007.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
WLCG Laura Perini1 EGI Operation Scenarios Introduction to panel discussion.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider Program OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider.
IAG – Israel Academic Grid, EGEE and HEP in Israel Prof. David Horn Tel Aviv University.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Feb 15, 20071/6 OSG EB Meeting – VO Services Status Gabriele Garzoglio VO Services Status OSG EB Meeting Feb 15, 2007 Gabriele Garzoglio, Fermilab.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Towards deploying a production interoperable Grid Infrastructure in the U.S. Vicky White U.S. Representative to GDB.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.
POW MND section.
Data Challenge with the Grid in ATLAS
Update on Plan for KISTI-GSDC
Readiness of ATLAS Computing - A personal view
LHC Data Analysis using a worldwide computing grid
gLite The EGEE Middleware Distribution
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010

Overview - 1 From a Tier1 Perspective It works ! Things have generally worked well. We have not been continuously fixing problems. There are specific issues, but overall it works. 04 January 2016 A Tier-1 Viewpoint

Overview 1a 04 January 2016 A Tier-1 Viewpoint

Overview - 2 The workload has been uneven, with peaks (e.g. in rates of data transfers), but also quiet periods. There has been plenty of time to 'catch up' with any data flows if there have been problems. 04 January 2016 A Tier-1 Viewpoint

Overview - 3 We are still waiting for the continuous data flow we expected. Is it going to turn up? Have not really experienced contention between VOS yet. 04 January 2016 A Tier-1 Viewpoint

RAL: new building summer ‘09 04 January 2016 A Tier-1 Viewpoint

Performance & Operations 04 January 2016 A Tier-1 Viewpoint Freeze Security Updates Jan Mar Feb Oct Dec Nov Sep LHC running - stability Updates Database infrastructure problems Restore Databases to original disk arrays.

Our Experience at RAL Move from a more ‘development’ oriented facility to a more ‘production’ service. E.g. Formal out-of-hours cover Weekly review of callouts ‘Production’ Team Incident Management process –Four levels of escalation and aims to focus attention (and resources) on problems to minimise their adverse impact. Change Control Process 04 January 2016 A Tier-1 Viewpoint

Our Experience at RAL Change Control Process –Improve quality of changes; –Assessment likelihood of problems; –Identify hazards – get input from other teams; –Not be too onerous. Delegate responsibility to teams and individuals. Some (many) changes are below threshold. Challenge of making big, higher risk changes. 04 January 2016 A Tier-1 Viewpoint

Changes per Month 04 January 2016 A Tier-1 Viewpoint

RAL - Experiences Castor has worked well. –Stable, good performance. Problems around Oracle / Fibre Channel SAN infrastructure –Fragile, sensitive to changes Issues relating to new building –Nothing has impacted service availability since October Noise on power from UPS Dust from abrasion of lagging on cold pipes. Disk Servers –A significant cause of operational problems. –Crash rate of ~1 / server/ lifetime => 2 per week. We have good communications with the LHC VOs 04 January 2016 A Tier-1 Viewpoint

Issues Some sites not fully loaded –Fair shares quickly used up by a single VO. Low CPU efficiencies effectively reduce available capacity. Occupancy of batch system at RAL for last 6 months: –Idle batch worker nodes use ~18tonnes CO 2 /month (~£3K electricity). 04 January 2016 A Tier-1 Viewpoint

Monitoring Access to other monitoring: Access to FTS logs at other sites (e.g. BNL log file viewer) Restricted access to experiment dashboards (LHCb) Maybe some integration of monitoring? Need better monitoring tools for network flows (RAL) Issues: Reliability of VO-specific (SAM) tests CE tests queuing behind production work. Do the SAM tests reflect real availability of sites? –Some times site ‘degraded’. 04 January 2016 A Tier-1 Viewpoint

Middleware etc. BDII issues of late. –Is everything in there required? Access to VO Software areas – High Loads –High loads – timeouts. –IN2P3 reported they make three replicas of more than 5.8 million files for the Atlas software server. 04 January 2016 A Tier-1 Viewpoint

Organizational Daily WLCG Meetings seem effective Fortnightly Tier1 Co-ordination meetings –Sometimes deal with CERN specific things(?) Alarm Tickets Seem to need a lot of testing Are used responsibly Response to Questions from Sites to VOs may be patchy. EGEE -> EGI Transition. SAM -> Nagios transition –At RAL callout on failing OPS VO tests – uncertainty as to whether our method of obtaining results would continue. 04 January 2016 A Tier-1 Viewpoint

M. Ernst 16 U.S. ATLAS Facility view of wLCG/OSG Services   For U.S. ATLAS with their distributed Facilities comprising the Tier-1 center at BNL, 5 multi-institution Tier-2 centers and ~40 Tier-3 centers OSG is the right forum for building a community for establishing best practices and knowledge sharing about building and evolving the computing infrastructure within the campus of institutions in the U.S. participating in LHC data analysis.   OSG serves as an integration and delivery point for interoperable core middleware components including compute and storage elements (VDT) in support of the Tier-1, Tier-2s and Tier-3s in the U.S.   Cyber Security operations support within OSG and across Grids in case of security incidents   Cyber Security infrastructure including site-level authorization service, operational service for updating certificates and revocation lists   Support for integration and extension of security services in the PanDA workload management system and the GUMS grid identity mapping service, for compliance with w/LCG/OSG security policies and requirements   Service availability monitoring of critical site infrastructure services, i.e. Computing and Storage Elements (RSV)   Service availability monitoring and forwarding of results to wLCG   Site level accounting services and forwarding of accumulated results to wLCG   Consolidation of Grid client utilities including incorporation of the LCG client suite   Integration, packaging and technical support for distributed data management and storage services (LFC server and client packaging, storage management systems (dCache, BestMan/xrootd)   Integration testbed for new releases of the OSG software, pre-production deployment testing with Panda   OSG provides continual support for the evolving distributed U.S. ATLAS Computing Facilities and their participation in worldwide production services through weekly OSG production meetings   Excellent response from OSG to specific (U.S.) ATLAS requirements

Worries……. Disk Server Procurement: –Big Commitment. Long lead Times. –Split into 2 tranches to spread risk. Hold forward buffer. –Problems in RAID controller Disk interface. Changes in workload: –Hotfiles –‘Accidental’ Denial of Service attack. Loading of networks / storage Loading of software servers We have not proven we can work with continually high workload / data throughput. Future funding Economic conditions in many countries……. 04 January 2016 A Tier-1 Viewpoint

Summary It Works (so far) Variable workload Not yet seen the continuous high data rates List of specific problems 04 January 2016 A Tier-1 Viewpoint