Download presentation
Presentation is loading. Please wait.
Published byNorman Richards Modified over 9 years ago
1
WLCGWLCG: Experience from Real Data Taking at the LHC Jamie.Shiers@cern.ch WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius, Lithuania More information: LHC Physics Centre at CERNLHC Physics Centre at CERN
2
Overview Summarize the Worldwide LHC Computing Grid: – In terms of scale & scope; – In terms of service. Highlight the main obstacles faced in arriving at today’s situation Issues, concerns & directions for the future As this presentation will be given remotely, I have deliberately kept it as free as possible from any graphics or animations 2WLCG: Experience from Real Data Taking
3
The Worldwide LHC Grid (WLCG) Simply put, this is the distributed processing and storage system deployed to handle the data from the world’s largest scientific machine – the Large Hadron Collider (LHC) Based today on grid technology – including the former EGEE infrastructure in Europe plus the Open Science Grid in US WLCG is more than simply a customer of EGEE: it has been and continues to be a driving force not only in the grid domain but also others, such as storage and data management WLCG has always been about a production service – one that is needed 24 x 7 most days (362) per year – Much activity – particularly at Tier0 and Tier0-Tier1 transfers – takes place at nights and over weekends (accelerator cycle) 3WLCG: Experience from Real Data Taking
4
The WLCG Deployment Model WLCG is the convergence of grid technology with a specific deployment model, elaborated in the late 1990s in the “Modelling of Network & Regional Centres” (MONARC) project This defined the well-known hierarchy Tier0/Tier1/Tier2 that is now common to several disciplines and matches well to International Centre / National Centres / Local Institutes MONARC originally foresaw limited networking between Tier0/Tier1/Tier2s – with air freight as a possible backup to (best case) 622Mbps links (cost!), as well as a smaller number of centres than we have today – We have redundant 10Gbps links: T0-T1 & also T1-T1, some of which are on occasion max-ed out! These base assumptions are currently being re-discussed 4WLCG: Experience from Real Data Taking
5
The WLCG Scale Deployed worldwide: Americas, Europe & Asia-Pacific Computational requirements: O(10 5 ) cores Networking requirements: routinely move 1PB of data per day between grid sites – significant intra-site requirements Single VO transfers CERN-Tier1s >4GB/s over sustained periods (~days) Annual growth in stored data: 15PB – Old calculation: # copies & location(s) of data may well be revised in coming months as well as trigger rates & event sizes Sum of resources at each tier approximately equal – 1 Tier0, ~10 Tier1s, ~100 Tier2s Sum of tickets at each tier (service metric) also ~equal! – A few: rarely as many as 5 per VO per day (OPS meeting) 5WLCG: Experience from Real Data Taking
6
The WLCG Service The WLCG Service is “held together” through (week)daily operations calls / meetings – remote participation is a requirement in all meetings!operations – Attended by representatives from all main LHC experiments, Tier1 sites and (typically) OSG & GridPP & chaired by a WLCG Service Coordinator on Duty Follow-up on key service issues, site problems & accelerator / experiment news – Calls typically last 30’ with notes – based on pre-minutes – circulated shortly after Longer-term issues, including roll-out of new releases, handled at fortnightly WLCG Service Coordination meetings WLCG Service Coordination In addition, regular “Collaboration” and topical workshopsworkshops – Collaboration workshops attended by 200 – 300 people, many of whom (e.g. Tier2s) do not attend more frequent events Experiments also hold regular meetings with their sites, as well as “Jamborees” – WLCG adds value in helping to identify / provide common solutions and/or when dealing with sites supporting multiple VOs – Typically the case for Tier1 sites outside of North America WLCG: Experience from Real Data Taking6
7
WLCG: Service Reporting Regular service reports are made to the WLCG Management Board based on Key Performance Indicators together with reports on any major service incident (“SIRs”)WLCG Management BoardSIR The KPIs: Site Usability based on VO tests; GGUS summaries (user, team & alarm tickets); # of SIRs & their nature GGUS This can be summarized in one (colour) slide: drill-down provided when not A-OK (usually…) WLCG: Experience from Real Data Taking7
8
WLCG Operations Report – Summary 8 KPIStatusComment GGUS tickets1 “real” alarm, 4 test alarms; normal # team and user tickets. Drill-down on real alarms; comment on tests. Site UsabilityMinor issuesDrill-down (hidden) SIRs & Change assessmentsFour incident reportsDrill-down on each VOUserTeamAlarmTotal ALICE6118 ATLAS2167189 CMS2316 LHCb025227 Totals29965130
9
WLCG: Team & Alarm tickets These are two features – along with the weekly ticket summaries – that were introduced for WLCG into GGUS Team tickets – shared by “shifters” and passed on from one shift to another Alarm tickets – when you need to get someone out of bed: only named, authorized people may use them; only for agreed “critical services” Targets for expert intervention & problem resolution: statistics fortunately rather low 9WLCG: Experience from Real Data Taking
10
WLCG: Recent Problems Three real problems from the most recent MB report: 2 network-related and 1 data / storage management 1.Loss of DNS in Germany(!) – GGUS black-outGGUS black-out 2.Network performance on CERN – Taiwan link: delays in problem resolution (problem turned out to be near Frankfurt) due to holiday w/eCERN – Taiwan Major data issue: a combination of events led to data loss which IMHO cannot be tolerateddata issue WLCG: Experience from Real Data Taking10
11
WLCG Networking Issues Network works well most of the time: adequate bandwidth, backup links Occasional problems often lengthy (days) to resolve: future direction is to be more network- centric (breaking MONARC model, more flexible “hierarchy”, remote data access) Network problems: lines cut by motorway construction, fishing trawlers & Tsunamis Tighter integration of network operations with WLCG operations already needed WLCG: Experience from Real Data Taking11
12
WLCG Service: Challenges The “Service Challenge” programme was initiated in 2004 to deliver a full production service capable of handling data taking needs well in advance of real data taking Whilst we have recently been congratulated for the status and stability of the service, it has been much more difficult and taken much longer than anyone anticipated We should have been where we are today 2-3 years ago! The service works with most problems being addressed sufficiently quickly – but there are still many improvements needed and the operations cost is high (sustainable?) These include adapting to constantly evolving technology (multi-core, virtualization, changes in the data & storage management landscape, clouds computing, …) As well as the move from EGEE to EGI… WLCG: Experience from Real Data Taking12
13
(Some) Data Related Issues Data Preservation Storage Management Data Management Data Access
14
WLCG & Data: Futures Just as our MONARC+grid choice was made some time ago and is now being reconsidered, so too were our data management strategies Whilst for a long time HEP’s data volumes and rates were special, they no longer are, nor are our requirements along other axes, such as Data Preservation “Standard building blocks” now look close to production readiness for at least some of our needs (tape access & management layer, cluster/parallel filesystems, data access…) requiring still some “integration glue” which can hopefully be shared with other disciplines “Watch this space” – data & storage management in HEP looks about to change (but not too much: it takes a long time to roll out new services and/or migrate multi-PB of data…) WLCG: Experience from Real Data Taking14
15
WLCG & EGEE The hardening of the WLCG service took place throughout the entire lifetime of the EGEE project series: 2004 – 2010 The service continues to evolve: one recent change was in the scheduling of site / service interventions with respect to LHC machine stops: agreed no more than 3 sites or 30% of the total Tier1 resources to a given VO can be (scheduled) down at the same time Another was the introduction of Change / Risk assessments prior to service interventions: still not sufficiently rigorous Constant service evolution is clearly measurable over a time scale of months, e.g. via the Quarterly Reports WLCG: Experience from Real Data Taking15
16
EGI InSPIRE SA3 / WP6 Another significant change or transition that we are now facing is the move from EGEE to EGI WP6 “Services for Heavy Users” is the only WP in InSPIRE led by CERN and addresses a variety of disciplines: HEP, LS, ES, A&A, F, CCMST Intent: continue to seek commonalities and synergies, not only within but also across disciplines Also “bind” related projects in which CERN is involved in LS & ES areas, plus links with A&A + F WLCG: Experience from Real Data Taking16
17
SA3 – Long Term Goals One of the key issues that needs to be addressed by this Work Package is a model for long-term sustainability This could have multiple elements: – Some services moving to standard infrastructure; – Some of the current effort coming from the VOs themselves: this is the standard model in HEP since many years! – A nucleus of expertise to assist with migrations to new service versions and/or larger changes should nevertheless be foreseen: for HEP this needs to be higher than today Expect to elaborate on these issues through regular workshops and conferences, including EGI Technical and User Fora plus other multi-disciplinary events – IEEE NSS & MIC brings together a number of those in SA3 IEEE NSS & MIC WLCG: Experience from Real Data Taking17
18
WLCG, the LHC & Experiments Metric for WLCG’s success: – WLCG seen and acknowledged as delivering a service that is an essential part of the LHC physics programme Metric for EGI InSPIRE SA3’s success: – SA3 seen and acknowledged both within and across the disciplines that it supports but also beyond: the longer term goals of RoI to science and society Investment in Research & Education is essential: the motivation clearly goes beyond direct returns WLCG: Experience from Real Data Taking18
19
Summary WLCG delivers a production service at the Terra & peta scale, it permits local investment and exploitation, solving the “brain-drain” and related problems of previous solutions It has taken longer & been a lot harder than foreseen Many of the basic ingredients work well and are applicable beyond grids; complexity has not always been justified Expect some changes, e.g. in Data Management + a more network-centric deployment model, in coming years: these changes need to be adiabatic WLCG: Experience from Real Data Taking19
20
LHCCLHCC Referees… “First experience of the World-wide LHC Computing Grid (WLCG) with the LHC has been positive. This is very much due to the substantial effort invested over several years during the intensive testing phase and all Tier centres must take credit for this success. The LHCC congratulates the WLCG on the achievements.” May 2010 WLCG: Experience from Real Data Taking20
21
BACKUP WLCG: Experience from Real Data Taking21
22
Intervention & Resolution Targets Targets (not commitments) proposed for Tier0 services – Similar targets requested for Tier1s/Tier2s – Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿ Tier1s: 95% of problems resolved <1 working day ? ¿ Tier2s: 90% of problems resolved < 1 working day ? Post-mortem triggered when targets not met! 22 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / phone call99% 1 hourOperator response to alarm / phone call100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% 22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.