LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager.

LHCbComputing Lessons learnt from Run I

LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager

The toddler years m Gaudi: 1998 o Object oriented, C++9x, STL, (CERNLIB) d NO vector instructions (pre SSE2) o Architecture done from scratch P But needed backward compatibility from the start: d Existing simulation/reconstruction from TP (SICB) d TDR studies had to be supported throughout m DIRAC: 2002 o Supporting major production and user analysis uninterrupted since DC04 m Computing TDR: 2005 o Computing model designed for 2kHz HLT rate 3

Growing up m An incremental approach to deployment of new features o Adapting to changing requirements and environment P E.g. one order of magnitude increase in HLT output rate o Learn from the past, throw away what works badly, keep and improve what works well m Development in parallel with running production system o Physics software in production since 1996 P detector design, detector optimisation, HLT+physics preparation, physics exploitation o Production system continuously supporting major productions and analysis since 2004 m Strong constraint also for the future o Continue to support running experiment o Continue to support analysis of legacy data P Minimise pain of maintenance by supporting legacy data in new software o Do not underestimate training effort to keep users on board 4

6 Andrei Tsaregorodtsev – Sept 2004

Reconstruction m Experience from Run 1: o Reco14 vs. Reco12 (reprocessing of 2011 data) P Significant differences in signal selection o Reco14 (first reconstruction of 2012 data): P If we can provide calibration “fast enough”, we do not need reprocessing m Run 2 strategy for optimal selection efficiency: o Online calibration P Use it both online and offline, reprocessing becomes unnecessary o Online reconstruction P Sufficiently fast to run identical code online and offline m Given fast enough code and/or sufficient resources in the HLT farm, could skip “offline” reconstruction altogether o Opens up new possibilities P Reduced need for RAW data offline (c.f. TURBO stream) P Optimise reconstruction for HLT farm hardware 7

Validation is key m Long history of code improvements o But fixes mostly done as result of “crises” in production P “St.Petersburg crisis” (July 2010) d Explosion of CPU time due to combinatorics at high (>2) mu  Was not foreseen, had never been properly tested with simulation  Introduction of GECs P “Easter crisis” (April 2012) d Memory leaks in new code insufficiently tested on large enough scale P pA reconstruction (February 2013) d Events killed by GECs introduced in 2010…  Big effort to understand tails, bugs found, optimisations done so cuts could be relaxed m Better strategy in place for 2015 startup o Turbo-validation stream P Allowed to study online-offline differences P Huge validation effort to understand (and fix) differences m But still fighting with fire o PbPb / PbAr reconstruction P Starting now… Experience from 2013 should help 8

Validation Challenges m Clearly we do not do enough validation o We don’t give ourselves enough time o The tools are difficult to use o It has little visibility P Fire-fighters are more sexy than building safety officers! m We need to be better organised and have formal goals o Online-Offline differences work was a good example P Well recognised, accepted metric, with regular reporting m We have a huge army of “validators”, they are the physics analysts o Get them more involved in defining and testing functionality o Use current data-taking and analysis to test, commission and deploy new concepts for the future m Where software has to be written from scratch, be more formal about software quality and validation o Needs a cultural change 9

Event selection (a.k.a. Trigger and Stripping) m Event selection has moved closer to the detector P Computing TDR (2005): d 2kHz HLT rate (of which 200 Hz b-exclusive) d 9.6 reduction factor in Stripping P Run 1: d 5kHz HLT rate d ~2.0 reduction factor in Stripping (S21) P Run 2 (1866 colliding bunches) d 22.6kHz HLT rate !! (18kHz FULL, 4kHz Turbo, 0.6kHz Turcal) d <2.0 reduction in Stripping m Rethink role of “Stripping” o Most of event rate reduction now done in HLT o Event size reduction P Removes RAW selectively d If not needed for offline, why do we even write it out of HLT? P Reduces DST size selectively (various MDST streams) d TURBO does something similar, already in HLT… o Streaming reduces number of events to handle in an analysis P Could we use an index instead? 10

Analysis model m One size does not fit all o DST, MDST, Turbo o Compromise between event rate and ability to redo (parts of) the reconstruction m What about N-tuples? o Completely outside current computing model, but these are what analysts mostly access m Should access to stripping output be reduced, in favour of centralised N-Tuple production o Greater role for working group productions o c.f. Alice analysis trains m One size does not fit all also for simulation o Fast simulation options 11

Data Popularity, disk copies m 25% of disk space occupied by files not accessed since one year m Can we be more clever? o Optimal number of copies? o More active use of tape? o Introduce working group N-Tuples into the model? 12

Software preservation m We have big difficulties maintaining old software operational o e.g. Reco14 and Stripping21 cannot run as is with Sim09 P Due to ROOT5 – ROOT6 incompatibilities o e.g. “swimming” Run 1 trigger on Reco14 data P Due to new classes on Reco14 DSTs, not known to Moore in 2012 m Similar problems with CondDB o Increasingly complex dependencies between: P software versions (new functionality, new calibration methods) P real detector geometry changes P Different calibration or simulation versions m Need much improved continuous validation of important workflows o To catch backward incompatibilities when they happen o To make introduction of new concepts less error prone 13

Infrastructure changes m Sometimes necessary, sometimes desirable, always long o e.g. CMT to CMake P Basic tools ready since long P Fully supported by most recent releases P Not transparent to users, requires training and convincing P Large body of legacy scripts difficult to migrate d e.g. SetupProject for Run1 versions of Moore o e.g. CVS to SVN P Huge amount of preparatory work P Relatively fast and painless user migration o e.g. Job Options P Still plenty of.opts files in production releases  Lbglimpse opts DaVinci v38r0 | grep ".opts:" | grep -v GAUDI | wc –l 238 m Supporting two systems for too long o How can we go faster? P And keep everyone on board? o What about old software stacks? 14

External collaboration m History of successful examples P ROOT, GEANT4, CERNLIB, Gaudi, Generators…. o And less successful ones P LCG middleware o Sometimes not obvious that pain is worth the gain m We cannot ignore reality o Funding agencies increasingly asking questions about our software sharing o We are too few to do everything ourselves m Development is fun, maintenance less so o Including a (long term) maintenance strategy might make compromises with third parties more attractive 15

LHCbComputing Ready to become a (young) adult?

LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager.

Similar presentations

Presentation on theme: "LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager.

Similar presentations

Presentation on theme: "LHCbComputing Lessons learnt from Run I. LHCbComputing Lessons learnt from Run I Growing pains of a lively teenager."— Presentation transcript:

Similar presentations

About project

Feedback