Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013.

Similar presentations


Presentation on theme: "Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013."— Presentation transcript:

1 Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013

2 2 TDOC Membership  ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic  ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis  CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer  LHCb: Renaud Legac, Niko Neufeld Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

3 3 LHC Context  HL-LHC events: omore complex (pile-up ˜150 rather than ˜25 in run 1, also energy x ˜2)  No reliable estimate today of the impact on CPU, as existing code shows non linear divergence. Indicatively, multiplicity increases by a factor 8. ohigher read-out rates (factor ˜10)  Flat resources (in euros) and Moore’s law give us a factor 10 in CPU power (if and only if we can use the processors as efficiently as today!)  handling HL-LHC event added complexity, and maintenance/improvement of processor efficiency orely on software improvements  Run 2 already has some of these issues in a milder way Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

4 4 Major crisis ? Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Ariane V maiden flight Use of same software on a faster rocket…

5 5 CPU Context  More cores per CPU o(and possibly specialised cores) oSo far largely ignored : treat one core as one full CPU. Fine for o(10) cores oWill break down for o(100) cores  calling for macro-parallelism, handled at framework level oRelatively small impact on sw developers if framework is smart enough oHowever does not help throughput if enough I/O and memory  Micro-parallelism oModern cores are not used efficiently by HEP software. Many cache misses. SIMD (Single Instruction Multiple Data), ILP (Instruction Level Parallelism) etc… to be used oGPU oExpert task. Focus on hot spots Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

6 6 LHC experiments code base  LHC experiments code base o˜5 millions line of code per experiment owritten by up to 1000 people per experiment since y2000 oup to 300 people active per experiment (but dropping in 2011 2012 2013) (new blood needed)  Who are they ? oVery few software engineers oFew physicists with very strong software expertise oMany physicists with ad-hoc software experience  All these people need take part to the new transition Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

7 7 One core one job  Today, typical grid workhorse is a 16GB memory, 8 core CPU (2GB/core)  Each core is adressed by the batch system as a separate processor  Each job process event one by one, running one by one a finite number of algorithms  One processor may handle simultaneously e.g. one Atlas reco job, 3 CMS simulation job, and 4 LHCb analysis jobs  This works (today), however disorganised competition for resources like memory, I/O Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

8 8 One processor one job  Available today (GaudiMP, AthenaMP) but not used in production  One job goes to one processor (which is completely free)  The framework distributes event processing to all cores, while sharing common memory (code, conditions,…)  No change to algorithmic code required (in principle)  ˜50% reduction of memory achieved (w.r.t. independent jobs) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

9 9 Real life  Direct Acyclic Graph extracted from real reco job  Today, algorithms run sequentially Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

10 10 Event level parallelism  framework schedules intelligently the algorithms from their dependency graph  e.g. run tracking in parallel with calorimeter, then electron ID  in practice too few algorithms can run in parallel  most cores remain idle Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

11 11 Event level concurrent event processing  The framework process several events simultaneously…  …distributes intelligently algorithms to cores  can allocate more cores to slowest algorithms  can optimise use of specialised cores Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013  In addition to algorithm scheduling, the framework provides services to pipeline access to resources (I/O, conditions, message logging…)  Algorithms should be thread safe : no global object (except through the framework), only use thread safe services and libraries  Algorithms do not need to handle threads themselves  regular software physicist can write algorithms with ad-hoc training a.k.a the Holy Grail

12 12 Processing steps Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Generation Full Simulation ReconstructionAnalysi s Detector Fast Simulation

13 13 Generators  Each generator is a different beast, developed by a very small team (theorists usually)  Some commonalities: HepData, HepMC event record  Many generators N…NLO needed for many channels (even if “just” for systematics). (e.g.˜40 generators interfaced in Atlas)  Significant work in each experiment to interface each generator to the experiment framework  Overall CPU consumption not negligible (up to 10% of total grid)  Little input, little output, no database connection: candidate for parasitic use of HPC Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

14 14 Simulation  Dominates CPU consumption on the grid  HL-LHC : 10x read-out rate  10x n events simulated ? Even more due to increased requirement on precision  Continue effort on G4 optimisation: oG4 10.0 multi-threaded to be released Dec 2013 oRe-thinking core algorithms with vectorisation in mind  Rely on blend of G4/Fast sim/Parametric. Challenge : the optimal blend is very analysis dependent. But only one pot of resources. Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Geant4 4-vector smearing 1000s/evt ms/evt

15 15 Reconstruction  Reconstruct analysis objects from raw data  Trigger code is looking more and more like (offline) reconstruction code: oaccess to more info on the event odesire to have the exact same algorithms online and offline for better control of systematics ostill one major difference w.r.t. CPU optimisation: trigger optimised to reject fast bad events rather than reconstruction of good events Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

16 16 Impact of pileup on reco Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 2011 Tier0 2012 Tier0 Atlas CPU (s/event) Reconstruction CPU time at Tier0 vs pileup Fight non-linearities Improve everywhere

17 17 CPU per domain (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013  Tracking dominant in absolute and also non linear behavior  However, the sum of other domains (calorimeter, jets, muons,…) is similar  need to improve across the board  (CMS is more tracking dominated) Atlas

18 18 20% of algorithms responsible for 80% of CPU time: Pareto at work… CPU per algorithm (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 However, at high pile up and before optimisation a handful of algorithms dominate

19 19 Analysis software  Two levels of analysis (often mixed up) oEvent combinatorics (e.g. compute Higgs candidate mass per event):  Need explicit event loop  Root or dedicated frameworks CMSSW, Gaudi/Athena oFinal analysis  no explicit event loop  TTree->Draw  histogram manipulations  Limit/signal setting RooFit/RooStat  Root definitely the framework here.  Order kB per event, a small fraction of events used, not CPU intensive  I/O bounded jobs  In the future, x10 larger disks, x10 larger bandwidth but disk access rate unchanged to a few 100Hz  sophisticated data organisation/access methods will be needed Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

20 20 dedicated Ntuple plot KiloBytes Giga/MegaBytes root Analysis cycle RAW Analysis objects How much time does it take to redo a plot ? (binning, cut, new variable, new calibration, new algorithm,…) How much time does it take to redo a full analysis ? (properrly reweighted plots and values after all corrections) 1-3 months Every 4-12 months Few seconds every minute x Petabytes Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 analysis models Petabytes One or more dedicated intermediate datasets with event selection and information reduction. Balance between completeness, resource usage, speed to use and speed to redo. Difficulty : ˜100 analysis streams/teams sharing same resources

21 21 Software for GPU  Graphic co processor massively parallel, up to x100 speed-up on paper  In practice, task must be prepared by traditional CPU and transferred to GPU  Successfully used in HEP for very focussed usage e.g. Alice HLT tracking (gain factor 3 in farm size)  Code need to be written from scratch using libraries such as Cuda, etc…  …and largely rewritten/retuned again for different processors, generation  Need to maintain a second, traditional, version of the code for simulation  Usage on the grid unlikely/difficult due to the variety of hardware  In the future, expect progress in generic libraries (e.g. OpenCL) which would ease maintenance (one code for all processors) at an acceptable loss in performance Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

22 22 Common software  Can we have more common software ?  One monster software with if (ATLAS) do_this(); if (LHCb) do_that(); ? certainly not…  Still, we can most likely do more that what we are doing right now  Note that we are largely running on the same Grid (even the same processor can run at one time some cores with Atlas jobs, some with CMS, some with LHCb)  Three angles: oFramework : introducing parallelism at different levels oFoundation libraries oHighly optimised HEP Toolboxes  Even if we do not share software, it is essential we share experience, developer tools and infrastructure (already happening e.g. in the concurrency forum)  We should share tutorials, e.g. do’s and dont’s to write thread safe code, proper use of vector libraries etc… Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

23 23 Frameworks  Hard to argue different HEP experiments have really different needs w.r.t their framework  In practice, the LHC experiments are using different frameworks (except Atlas and LHCb with different flavours of Gaudi)  Not the place to dwell on the many reasons for this  Also changing to a new framework is a huge change (compared to a drop-in replacement of e.g. minimisation algorithm)  Still if not code, many lessons can be shared Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

24 24 Toolboxes : FastJet case  FastJet success due to simultaneous occurrence of: oVery clever sequential-recombination algorithm (order nxlog(n) vs n^3) « just » a speed-up but a huge one at LHC multiplicities oNew physics algorithm AntiKt became a de facto standard oMany algorithms and plug-ins added later on oEasy to interface as a drop-in replacement of previous in house jet algorithms  No attempt to oversell: FastJet library still interfaced with experiment specific code o1) Mapping EDM back and forth o2) Preparing input: calibration, noisy channel masking o3) Massaging output : more calibration o2 and 3 are physics loaded, depends of the detector peculiarities and high level choices (e.g. calorimeter based jet vs particle flow jet) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

25 25 Toolboxes (2)  The algorithm should be sufficienty CPU intensive  Interfacing to and fro (from experiment specific representation to the Toolbox representation) should use negligible resource (CPU and memory) and little manpower  Most efficient data representation to be used (e.g. adjacent in memory)  Vectorisation and specific processor instruction used as much as possible (as few experts can do it)  Optimisation to the latest compilers and processors instruction set  Possible examples: oTrack extrapolating in inhomogeneous magnetic field, fitting (on the other hand pattern recognition most often require experiment specific implementation to maximise both CPU and physics performance) oCalorimeter clustering, different flavours e.g. sliding window, topological cluster  Benefit: oNot necessarily breakthrough algorithm like FastJet oFor LHC experiments where these algorithms already exist, share the CPU optimisation manpower (experts) oFor new experiment, no need to reinvent the wheel Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

26 26 Foundation libraries  Can study drop-in replacement of low level libraries like (examples): oArithmetic functions oMemory management oRandom number generators oGeometry (e.g. CLHEP) and 4-vectors (e.g. TLorentzVector)  Sometimes, even the right set of compiler/linker options can bring a few percent for free Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

27 27 Conclusion  HL-LHC : high pile-up and high read-out rate  large increase of processing needs  With flat resource (in euros), and even with Moore’s law holding true (likely, provided we maintain/improve efficient use of processors), this is not enough  Future evolution of processors: many cores with less memory per core, more sophisticated processors instructions (micro-parallelism), possibility of specialised cores  oParallel framework to distribute algorithms to cores, in a semi- transparent way to regular physicist software developer oOptimisation of software to use high level processors instructions, especially in identified hot spots (expert task)  LHC experiments code base more than 10 millions of line of code, written by more than 2000 people  a whole community to embark, starting essentially now, new blood to inject  We are sharing already effort and software. We can do much more: http://concurrency.web.cern.ch Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

28 Back-up slides Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

29 29 Notes  Mention Aida (voir mail Laurent)  Revoir talk edinburgh  Locality  Multithreading  Revoir analysis slide  Reine rouge  Plot d’un CPU  Pas assez d images a la fin Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013


Download ppt "Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013."

Similar presentations


Ads by Google