T1/T2 workshop – 7th edition - Strasbourg 3 May 2017 Latchezar Betev

Slides:

Advertisements

Similar presentations

31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.

Advertisements

Grid operations in 2014 ALICE Offline week 20 March 2015 Latchezar Betev.

– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.

ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.

ALICE Operations short summary LHCC Referees meeting June 12, 2012.

ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting September 22, 2015.

Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

ALICE Grid operations: last year and perspectives (+ some general remarks) ALICE T1/T2 workshop Tsukuba 5 March 2014 Latchezar Betev Updated for the ALICE.

Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007.

ALICE Operations short summary ALICE Offline week June 15, 2012.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

20 Dec Peter Clarke Computing production issues 1.Disk Space – update from high-mu workshop 2.Disk planning 3.Use of T2s NCB meeting 20-Dec-2010.

LCG-LHCC mini-review ALICE Latchezar Betev Latchezar Betev for the ALICE collaboration.

Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 01/12/2015.

Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

ALICE Grid operations +some specific for T2s US-ALICE Grid operations review 7 March 2014 Latchezar Betev 1.

LHCb Current Understanding of Italian Tier-n Centres Domenico Galli, Umberto Marconi Roma, January 23, 2001.

16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

WLCG November Plan for shutdown and 2009 data-taking Kors Bos.

ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

LHCb LHCb GRID SOLUTION TM Recent and planned changes to the LHCb computing model Marco Cattaneo, Philippe Charpentier, Peter Clarke, Stefan Roiser.

Grid Operations in Germany T1-T2 workshop 2015 Torino, Italy Kilian Schwarz WooJin Park Christopher Jung.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 24/05/2015.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Computing Operations Roadmap

Ian Bird WLCG Workshop San Francisco, 8th October 2016

ALICE internal and external network

Belle II Physics Analysis Center at TIFR

Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN

DPG Activities DPG Session, ALICE Monthly Mini Week

Workshop Computing Models status and perspectives

Data Challenge with the Grid in ATLAS

ALICE analysis preservation

for the Offline and Computing groups

Vanderbilt Tier 2 Project

ALICE Physics Data Challenge 3

Status and Prospects of The LHC Experiments Computing

Operations in 2012 and plans for the LS1

Offline data taking and processing

ALICE Computing : 2012 operation & future plans

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Readiness of ATLAS Computing - A personal view

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Disk capacities in 2017 and 2018 ALICE Offline week 12/11/2017.

Simulation use cases for T2 in ALICE

Offline shifter training tutorial

ALICE Computing Model in Run3

ALICE Computing Upgrade Predrag Buncic

New strategies of the LHC experiments to meet

The ATLAS Computing Model

Presentation transcript:

T1/T2 workshop – 7th edition - Strasbourg 3 May 2017 Latchezar Betev Grid operations in 2016 T1/T2 workshop – 7th edition - Strasbourg 3 May 2017 Latchezar Betev

T1/T2 workshops 7-th edition in Strasbourg CERN 2009 Karlsruhe 2012 Lyon 2013 Tsukuba 2014 Torino 2015 Bergen 2016 7-th edition in Strasbourg *towers are shorter than they appear

The ALICE Grid today 54 in Europe 3 in North America 9 in Asia UPB, Romania 9 in Asia 1 in Africa 2 in South America

=> New sites Basically unchanged since last year One more returning site – UPB in Romania Altaria is now a Cloud catch-all (see Maarten’s talk) Still no volunteering sites in Australia/New Zealand =>

New Supercomputers To use effectively, we will need a time allocation Set of network-free site services ready Substantial amount of work to rid application software of network calls Successful test with full chain (Monte-Carlo) Several issues remain Access necessitates hand-holding (token generation) Backfill yields many short time (minutes) slots, MC jobs are longer The AMD CPUs are almost 2x slower than the average Grid CPU We can’t use the abundant GPUs To use effectively, we will need a time allocation To be continued…

Job record – broken again April 2017 April 2016 => * This is not the highest we had, but you get the idea

CPU resources evolution 61360 43000 36600 74380 29160 24230 12300 LS1 LS1 Year on year increase +97% +20% +26% +17% +43% +21%

Disk Resources evolution 2016: T2 deficit 4PB 2017: T1 deficit 4PB T2 deficit 9PB Total 16% disk deficit (if 2016 T2 deficit is recuperated) More details – Yves’ talk on Friday

Reminder on replicas Tapes – only RAW data (2 copies) Derived data of active datasets ESDs – 1 copy AODs – 2 copies Inactive datasets – 1 copy Regular cleanups (unpopular and dark data) done and no more freedom to reduce replicas => Disk deficit will be increasingly difficult to compensate with ‘technical’ means

2015+2016 RAW data collection

Status of RAW data processing Substantial IR-induced distortions in the TPC Offline correction algorithms (embedded in CPass-0/CPass1) developed and validated Reco of 2016 RAW data @90%, 2015 data @60% Completion expected by June 2017 ~15% of Grid since Nov.

Status of MC processing Numerous general-purpose and specialized MC cycles anchored to 2015 and 2016 data Total of 135 (~same as in 2015) 2,897,020,717 events RAW and MC productions are now handled by the Data Preparation Group (DPG)

Resources sharing MC productions: 72% @all centres RAW data processing: 10%@ T0/T1s only Organized analysis: 13% @all centres Individual analysis: 4% @all centres, 480 users

Catalogue stats A more dramatic increase mitigated by cleanup 2015 2014 Δ = 270 Mio. 2015 Δ = 950 Mio 2016 Δ = 900 Mio. A more dramatic increase mitigated by cleanup That said – January-April 2017: +500 Mio new files RAW data reconstruction is a big generator New catalogue in the works (Miguel’s presentation)

Organized analysis 10000 jobs 9100 jobs 5800 jobs 4400 jobs 3000 jobs Year on year increase +47% +32% +57% +10%

Individual analysis 4600 jobs 446 ind.users 4700 jobs 432 ind.users Year on year increase Individual analysis -50% +3% -17% 0% Year on year increase organized analysis +47% +32% +57% +10%

Analysis evolution In the past 2 years Stable number of users Resources used by individual analysis are also stable The organized analysis share keeps increasing Not surprising – more data and more analysis topics This also brings increased load and demand on the storage

Grid efficiency 76% 84% 86% 82% 82% Year on year change +8% +2% -4% 0%

Grid efficiency evolution Efficiency is ~flat Specific effort to increase the efficiency of calibration tasks Part of the new calibration suite development With substantial increase of analysis, the efficiency is not degraded! In general, ~85% efficiency is perhaps the maximum we can expect No manpower to work on this (increasingly difficult to improve) topic

Storage availability 100% 28 SEs (40%) 99% 20 SEs (30%) 95% In one year 100% 28 SEs (40%) 99% 4 days downtime 20 SEs (30%) 95% 18 days downtime 7 SEs, (10%) 7.4PB, 15% 90% 37 days downtime 13 SEs, (20%), 2.6PB, 5% Total number of SEs = 68 (including tape), Only disk = 48PB total available space

Storage availability evolution Significant capacity still below the target 95% availability 20% of the current 48PB Tools for the sysadmins exist to follow up on the problems Despite that, we have to sent messages ‘look at your storage’ Question on what to do with <90% SEs – should these be ‘transferred’ to other sites? Reminder - replica model is 1 ESD copy only Thanks to fully distributed model we have ~5% of data loss This is at about the ‘pain’ threshold for analyzers

New storage capacity in 2017 43/48 PB disk used (90%) Cleanup ongoing, but productions are filling faster… We need the 2027 capacity installed, especially on SEs which are ≥90% full Presently, these are 24 of 58 and many large >1PB SEs

Storage use 195PB 240PB 433PB 20PB 26PB 37PB Write +30% Write +42% Year on year change Read +23% Read +80% Ratio 9.2 Ratio 12

Storage use evolution Read volume increase Preparation for QM’2017 (passed) and now SQM’2017 Increased number of analysis trains + read data volume Positive aspects and points to worry about No perceptible degradation of overall efficiency The storage is not (yet) saturated, but there will be an inflection point Large capacity disks => lower i/o per unit – something to worry about Effect seen on loaded SE units – tape buffer Cleanup ‘Orphan files’ < 1%, verified periodically Deletion of files and replica reduction – constant activity

The Fifth Element - Network Still one step ahead (Europe/USA) Within the present computing model Large growth of computing capacities in Asia Regional network - routing to be improved Now followed at the Asian Tier Center Forum (ACTF) South Africa – same issue and active follow-up

Summary 2016 was a great year for Grid operations The impressive data taking continues, so does the pressure on the computing resources Efficiency remains high, site stability remains high Storage stability of few sites must be improved The ALICE upgrade is one year closer Development of new software is accelerating On track for the third year of LHC Run2

For those who still did not do it – please upload your presentations Thank you to all who contributed to 2016 being another great Grid year! Thanks to all speakers For those who still did not do it – please upload your presentations