LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager.

Slides:



Advertisements
Similar presentations
Clara Gaspar on behalf of the LHCb Collaboration, “Physics at the LHC and Beyond”, Quy Nhon, Vietnam, August 2014 Challenges and lessons learnt LHCb Operations.
Advertisements

Stripping Plans for 2014 and 2015 Laurence Carson (Edinburgh), Stefano Perazzini (Bologna) 2 nd LHCb Computing Workshop,
– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.
Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.
Reconstruction and Analysis on Demand: A Success Story Christopher D. Jones Cornell University, USA.
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
1 Towards an Upgrade TDR: LHCb Computing workshop May 2015 Introduction to Upgrade Computing Session Peter Clarke Many peoples ideas Vava, Conor,
Hall D Online Data Acquisition CEBAF provides us with a tremendous scientific opportunity for understanding one of the fundamental forces of nature. 75.
L3 Filtering: status and plans D  Computing Review Meeting: 9 th May 2002 Terry Wyatt, on behalf of the L3 Algorithms group. For more details of current.
LHCb Quarterly Report October Core Software (Gaudi) m Stable version was ready for 2008 data taking o Gaudi based on latest LCG 55a o Applications.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
W.Smith, U. Wisconsin, ZEUS Computing Board Zeus Executive, March 1, ZEUS Computing Board Report Zeus Executive Meeting Wesley H. Smith, University.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Plans for Trigger Software Validation During Running Trigger Data Quality Assurance Workshop May 6, 2008 Ricardo Gonçalo, David Strom.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
CDF Offline Production Farms Stephen Wolbers for the CDF Production Farms Group May 30, 2001.
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
Analysis trains – Status & experience from operation Mihaela Gheata.
LHCbComputing Manpower requirements. Disclaimer m In the absence of a manpower planning officer, all FTE figures in the following slides are approximate.
Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
ATLAS Trigger Development
LHCbComputing LHCC status report. Operations June 2014 to September m Running jobs by activity o Montecarlo simulation continues as main activity.
CS223: Software Engineering Lecture 2: Introduction to Software Engineering.
LHCbComputing Resources requests : changes since LHCb-PUB (March 2013) m Assume no further reprocessing of Run I data o (In.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
ATLAS Computing Requirements LHCC - 19 March ATLAS Computing Requirements for 2007 and beyond.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
Workflows and Data Management. Workflow and DM Run3 and after: conditions m LHCb major upgrade is for Run3 (2020 horizon)! o Luminosity x 5 ( )
OPERATIONS REPORT JUNE – SEPTEMBER 2015 Stefan Roiser CERN.
Marco Cattaneo Core software programme of work Short term tasks (before April 2012) 1.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
CWG7 (reconstruction) R.Shahoyan, 12/06/ Case of single row Rolling Shutter  N rows of sensor read out sequentially, single row is read in time.
FACTORS AFFECTING THE EFFICIENCY OF DATA PROCESSING SYSTEMS.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
Preservation of LEP Data There is still hope Is there? Marcello Maggi, Ulrich Schwickerath, Matthias Schröder, , DPHEP7 1.
16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
LHCb Computing activities Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group.
Some topics for discussion 31/03/2016 P. Hristov 1.
Marco Cattaneo, 3-June Event Reconstruction for LHCb  What is the scope of the project?  What are the goals (short+medium term)?  How do we organise.
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
Analysis Model Zhengyun You University of California Irvine Mu2e Computing Review March 5-6, 2015 Mu2e-doc-5227.
AliRoot survey: Calibration P.Hristov 11/06/2013.
MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.
LHCbComputing LHCb computing model in Run1 & Run2 Concezio Bozzi Bologna, Feb 19 th 2015.
LHCb LHCb GRID SOLUTION TM Recent and planned changes to the LHCb computing model Marco Cattaneo, Philippe Charpentier, Peter Clarke, Stefan Roiser.
Hall D Computing Facilities Ian Bird 16 March 2001.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
SuperB and its computing requirements
for the Offline and Computing groups
LHCb Software & Computing Status
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
ALICE Computing Model in Run3
ALICE Computing Upgrade Predrag Buncic
ILD Ichinoseki Meeting
OO-Design in PHENIX PHENIX, a BIG Collaboration A Liberal Data Model
Presentation transcript:

LHCbComputing Lessons learnt from Run I

LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager

The toddler years m Gaudi: 1998 o Object oriented, C++9x, STL, (CERNLIB) d NO vector instructions (pre SSE2) o Architecture done from scratch P But needed backward compatibility from the start: d Existing simulation/reconstruction from TP (SICB) d TDR studies had to be supported throughout m DIRAC: 2002 o Supporting major production and user analysis uninterrupted since DC04 m Computing TDR: 2005 o Computing model designed for 2kHz HLT rate 3

Growing up m An incremental approach to deployment of new features o Adapting to changing requirements and environment P E.g. one order of magnitude increase in HLT output rate o Learn from the past, throw away what works badly, keep and improve what works well m Development in parallel with running production system o Physics software in production since 1996 P detector design, detector optimisation, HLT+physics preparation, physics exploitation o Production system continuously supporting major productions and analysis since 2004 m Strong constraint also for the future o Continue to support running experiment o Continue to support analysis of legacy data P Minimise pain of maintenance by supporting legacy data in new software o Do not underestimate training effort to keep users on board 4

5

6 Andrei Tsaregorodtsev – Sept 2004

Reconstruction m Experience from Run 1: o Reco14 vs. Reco12 (reprocessing of 2011 data) P Significant differences in signal selection o Reco14 (first reconstruction of 2012 data): P If we can provide calibration “fast enough”, we do not need reprocessing m Run 2 strategy for optimal selection efficiency: o Online calibration P Use it both online and offline, reprocessing becomes unnecessary o Online reconstruction P Sufficiently fast to run identical code online and offline m Given fast enough code and/or sufficient resources in the HLT farm, could skip “offline” reconstruction altogether o Opens up new possibilities P Reduced need for RAW data offline (c.f. TURBO stream) P Optimise reconstruction for HLT farm hardware 7

Validation is key m Long history of code improvements o But fixes mostly done as result of “crises” in production P “St.Petersburg crisis” (July 2010) d Explosion of CPU time due to combinatorics at high (>2) mu  Was not foreseen, had never been properly tested with simulation  Introduction of GECs P “Easter crisis” (April 2012) d Memory leaks in new code insufficiently tested on large enough scale P pA reconstruction (February 2013) d Events killed by GECs introduced in 2010…  Big effort to understand tails, bugs found, optimisations done so cuts could be relaxed m Better strategy in place for 2015 startup o Turbo-validation stream P Allowed to study online-offline differences P Huge validation effort to understand (and fix) differences m But still fighting with fire o PbPb / PbAr reconstruction P Starting now… Experience from 2013 should help 8

Validation Challenges m Clearly we do not do enough validation o We don’t give ourselves enough time o The tools are difficult to use o It has little visibility P Fire-fighters are more sexy than building safety officers! m We need to be better organised and have formal goals o Online-Offline differences work was a good example P Well recognised, accepted metric, with regular reporting m We have a huge army of “validators”, they are the physics analysts o Get them more involved in defining and testing functionality o Use current data-taking and analysis to test, commission and deploy new concepts for the future m Where software has to be written from scratch, be more formal about software quality and validation o Needs a cultural change 9

Event selection (a.k.a. Trigger and Stripping) m Event selection has moved closer to the detector P Computing TDR (2005): d 2kHz HLT rate (of which 200 Hz b-exclusive) d 9.6 reduction factor in Stripping P Run 1: d 5kHz HLT rate d ~2.0 reduction factor in Stripping (S21) P Run 2 (1866 colliding bunches) d 22.6kHz HLT rate !! (18kHz FULL, 4kHz Turbo, 0.6kHz Turcal) d <2.0 reduction in Stripping m Rethink role of “Stripping” o Most of event rate reduction now done in HLT o Event size reduction P Removes RAW selectively d If not needed for offline, why do we even write it out of HLT? P Reduces DST size selectively (various MDST streams) d TURBO does something similar, already in HLT… o Streaming reduces number of events to handle in an analysis P Could we use an index instead? 10

Analysis model m One size does not fit all o DST, MDST, Turbo o Compromise between event rate and ability to redo (parts of) the reconstruction m What about N-tuples? o Completely outside current computing model, but these are what analysts mostly access m Should access to stripping output be reduced, in favour of centralised N-Tuple production o Greater role for working group productions o c.f. Alice analysis trains m One size does not fit all also for simulation o Fast simulation options 11

Data Popularity, disk copies m 25% of disk space occupied by files not accessed since one year m Can we be more clever? o Optimal number of copies? o More active use of tape? o Introduce working group N-Tuples into the model? 12

Software preservation m We have big difficulties maintaining old software operational o e.g. Reco14 and Stripping21 cannot run as is with Sim09 P Due to ROOT5 – ROOT6 incompatibilities o e.g. “swimming” Run 1 trigger on Reco14 data P Due to new classes on Reco14 DSTs, not known to Moore in 2012 m Similar problems with CondDB o Increasingly complex dependencies between: P software versions (new functionality, new calibration methods) P real detector geometry changes P Different calibration or simulation versions m Need much improved continuous validation of important workflows o To catch backward incompatibilities when they happen o To make introduction of new concepts less error prone 13

Infrastructure changes m Sometimes necessary, sometimes desirable, always long o e.g. CMT to CMake P Basic tools ready since long P Fully supported by most recent releases P Not transparent to users, requires training and convincing P Large body of legacy scripts difficult to migrate d e.g. SetupProject for Run1 versions of Moore o e.g. CVS to SVN P Huge amount of preparatory work P Relatively fast and painless user migration o e.g. Job Options P Still plenty of.opts files in production releases  Lbglimpse opts DaVinci v38r0 | grep ".opts:" | grep -v GAUDI | wc –l 238 m Supporting two systems for too long o How can we go faster? P And keep everyone on board? o What about old software stacks? 14

External collaboration m History of successful examples P ROOT, GEANT4, CERNLIB, Gaudi, Generators…. o And less successful ones P LCG middleware o Sometimes not obvious that pain is worth the gain m We cannot ignore reality o Funding agencies increasingly asking questions about our software sharing o We are too few to do everything ourselves m Development is fun, maintenance less so o Including a (long term) maintenance strategy might make compromises with third parties more attractive 15

LHCbComputing Ready to become a (young) adult?