Download presentation
Presentation is loading. Please wait.
Published byCollin Park Modified over 9 years ago
1
Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013
2
2 TDOC Membership ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer LHCb: Renaud Legac, Niko Neufeld Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
3
3 LHC Context HL-LHC events: omore complex (pile-up ˜150 rather than ˜25 in run 1, also energy x ˜2) No reliable estimate today of the impact on CPU, as existing code shows non linear divergence. Indicatively, multiplicity increases by a factor 8. ohigher read-out rates (factor ˜10) Flat resources (in euros) and Moore’s law give us a factor 10 in CPU power (if and only if we can use the processors as efficiently as today!) handling HL-LHC event added complexity, and maintenance/improvement of processor efficiency orely on software improvements Run 2 already has some of these issues in a milder way Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
4
4 Major crisis ? Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Ariane V maiden flight Use of same software on a faster rocket…
5
5 CPU Context More cores per CPU o(and possibly specialised cores) oSo far largely ignored : treat one core as one full CPU. Fine for o(10) cores oWill break down for o(100) cores calling for macro-parallelism, handled at framework level oRelatively small impact on sw developers if framework is smart enough oHowever does not help throughput if enough I/O and memory Micro-parallelism oModern cores are not used efficiently by HEP software. Many cache misses. SIMD (Single Instruction Multiple Data), ILP (Instruction Level Parallelism) etc… to be used oGPU oExpert task. Focus on hot spots Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
6
6 LHC experiments code base LHC experiments code base o˜5 millions line of code per experiment owritten by up to 1000 people per experiment since y2000 oup to 300 people active per experiment (but dropping in 2011 2012 2013) (new blood needed) Who are they ? oVery few software engineers oFew physicists with very strong software expertise oMany physicists with ad-hoc software experience All these people need take part to the new transition Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
7
7 One core one job Today, typical grid workhorse is a 16GB memory, 8 core CPU (2GB/core) Each core is adressed by the batch system as a separate processor Each job process event one by one, running one by one a finite number of algorithms One processor may handle simultaneously e.g. one Atlas reco job, 3 CMS simulation job, and 4 LHCb analysis jobs This works (today), however disorganised competition for resources like memory, I/O Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
8
8 One processor one job Available today (GaudiMP, AthenaMP) but not used in production One job goes to one processor (which is completely free) The framework distributes event processing to all cores, while sharing common memory (code, conditions,…) No change to algorithmic code required (in principle) ˜50% reduction of memory achieved (w.r.t. independent jobs) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
9
9 Real life Direct Acyclic Graph extracted from real reco job Today, algorithms run sequentially Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
10
10 Event level parallelism framework schedules intelligently the algorithms from their dependency graph e.g. run tracking in parallel with calorimeter, then electron ID in practice too few algorithms can run in parallel most cores remain idle Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
11
11 Event level concurrent event processing The framework process several events simultaneously… …distributes intelligently algorithms to cores can allocate more cores to slowest algorithms can optimise use of specialised cores Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 In addition to algorithm scheduling, the framework provides services to pipeline access to resources (I/O, conditions, message logging…) Algorithms should be thread safe : no global object (except through the framework), only use thread safe services and libraries Algorithms do not need to handle threads themselves regular software physicist can write algorithms with ad-hoc training a.k.a the Holy Grail
12
12 Processing steps Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Generation Full Simulation ReconstructionAnalysi s Detector Fast Simulation
13
13 Generators Each generator is a different beast, developed by a very small team (theorists usually) Some commonalities: HepData, HepMC event record Many generators N…NLO needed for many channels (even if “just” for systematics). (e.g.˜40 generators interfaced in Atlas) Significant work in each experiment to interface each generator to the experiment framework Overall CPU consumption not negligible (up to 10% of total grid) Little input, little output, no database connection: candidate for parasitic use of HPC Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
14
14 Simulation Dominates CPU consumption on the grid HL-LHC : 10x read-out rate 10x n events simulated ? Even more due to increased requirement on precision Continue effort on G4 optimisation: oG4 10.0 multi-threaded to be released Dec 2013 oRe-thinking core algorithms with vectorisation in mind Rely on blend of G4/Fast sim/Parametric. Challenge : the optimal blend is very analysis dependent. But only one pot of resources. Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Geant4 4-vector smearing 1000s/evt ms/evt
15
15 Reconstruction Reconstruct analysis objects from raw data Trigger code is looking more and more like (offline) reconstruction code: oaccess to more info on the event odesire to have the exact same algorithms online and offline for better control of systematics ostill one major difference w.r.t. CPU optimisation: trigger optimised to reject fast bad events rather than reconstruction of good events Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
16
16 Impact of pileup on reco Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 2011 Tier0 2012 Tier0 Atlas CPU (s/event) Reconstruction CPU time at Tier0 vs pileup Fight non-linearities Improve everywhere
17
17 CPU per domain (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Tracking dominant in absolute and also non linear behavior However, the sum of other domains (calorimeter, jets, muons,…) is similar need to improve across the board (CMS is more tracking dominated) Atlas
18
18 20% of algorithms responsible for 80% of CPU time: Pareto at work… CPU per algorithm (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 However, at high pile up and before optimisation a handful of algorithms dominate
19
19 Analysis software Two levels of analysis (often mixed up) oEvent combinatorics (e.g. compute Higgs candidate mass per event): Need explicit event loop Root or dedicated frameworks CMSSW, Gaudi/Athena oFinal analysis no explicit event loop TTree->Draw histogram manipulations Limit/signal setting RooFit/RooStat Root definitely the framework here. Order kB per event, a small fraction of events used, not CPU intensive I/O bounded jobs In the future, x10 larger disks, x10 larger bandwidth but disk access rate unchanged to a few 100Hz sophisticated data organisation/access methods will be needed Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
20
20 dedicated Ntuple plot KiloBytes Giga/MegaBytes root Analysis cycle RAW Analysis objects How much time does it take to redo a plot ? (binning, cut, new variable, new calibration, new algorithm,…) How much time does it take to redo a full analysis ? (properrly reweighted plots and values after all corrections) 1-3 months Every 4-12 months Few seconds every minute x Petabytes Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 analysis models Petabytes One or more dedicated intermediate datasets with event selection and information reduction. Balance between completeness, resource usage, speed to use and speed to redo. Difficulty : ˜100 analysis streams/teams sharing same resources
21
21 Software for GPU Graphic co processor massively parallel, up to x100 speed-up on paper In practice, task must be prepared by traditional CPU and transferred to GPU Successfully used in HEP for very focussed usage e.g. Alice HLT tracking (gain factor 3 in farm size) Code need to be written from scratch using libraries such as Cuda, etc… …and largely rewritten/retuned again for different processors, generation Need to maintain a second, traditional, version of the code for simulation Usage on the grid unlikely/difficult due to the variety of hardware In the future, expect progress in generic libraries (e.g. OpenCL) which would ease maintenance (one code for all processors) at an acceptable loss in performance Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
22
22 Common software Can we have more common software ? One monster software with if (ATLAS) do_this(); if (LHCb) do_that(); ? certainly not… Still, we can most likely do more that what we are doing right now Note that we are largely running on the same Grid (even the same processor can run at one time some cores with Atlas jobs, some with CMS, some with LHCb) Three angles: oFramework : introducing parallelism at different levels oFoundation libraries oHighly optimised HEP Toolboxes Even if we do not share software, it is essential we share experience, developer tools and infrastructure (already happening e.g. in the concurrency forum) We should share tutorials, e.g. do’s and dont’s to write thread safe code, proper use of vector libraries etc… Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
23
23 Frameworks Hard to argue different HEP experiments have really different needs w.r.t their framework In practice, the LHC experiments are using different frameworks (except Atlas and LHCb with different flavours of Gaudi) Not the place to dwell on the many reasons for this Also changing to a new framework is a huge change (compared to a drop-in replacement of e.g. minimisation algorithm) Still if not code, many lessons can be shared Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
24
24 Toolboxes : FastJet case FastJet success due to simultaneous occurrence of: oVery clever sequential-recombination algorithm (order nxlog(n) vs n^3) « just » a speed-up but a huge one at LHC multiplicities oNew physics algorithm AntiKt became a de facto standard oMany algorithms and plug-ins added later on oEasy to interface as a drop-in replacement of previous in house jet algorithms No attempt to oversell: FastJet library still interfaced with experiment specific code o1) Mapping EDM back and forth o2) Preparing input: calibration, noisy channel masking o3) Massaging output : more calibration o2 and 3 are physics loaded, depends of the detector peculiarities and high level choices (e.g. calorimeter based jet vs particle flow jet) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
25
25 Toolboxes (2) The algorithm should be sufficienty CPU intensive Interfacing to and fro (from experiment specific representation to the Toolbox representation) should use negligible resource (CPU and memory) and little manpower Most efficient data representation to be used (e.g. adjacent in memory) Vectorisation and specific processor instruction used as much as possible (as few experts can do it) Optimisation to the latest compilers and processors instruction set Possible examples: oTrack extrapolating in inhomogeneous magnetic field, fitting (on the other hand pattern recognition most often require experiment specific implementation to maximise both CPU and physics performance) oCalorimeter clustering, different flavours e.g. sliding window, topological cluster Benefit: oNot necessarily breakthrough algorithm like FastJet oFor LHC experiments where these algorithms already exist, share the CPU optimisation manpower (experts) oFor new experiment, no need to reinvent the wheel Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
26
26 Foundation libraries Can study drop-in replacement of low level libraries like (examples): oArithmetic functions oMemory management oRandom number generators oGeometry (e.g. CLHEP) and 4-vectors (e.g. TLorentzVector) Sometimes, even the right set of compiler/linker options can bring a few percent for free Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
27
27 Conclusion HL-LHC : high pile-up and high read-out rate large increase of processing needs With flat resource (in euros), and even with Moore’s law holding true (likely, provided we maintain/improve efficient use of processors), this is not enough Future evolution of processors: many cores with less memory per core, more sophisticated processors instructions (micro-parallelism), possibility of specialised cores oParallel framework to distribute algorithms to cores, in a semi- transparent way to regular physicist software developer oOptimisation of software to use high level processors instructions, especially in identified hot spots (expert task) LHC experiments code base more than 10 millions of line of code, written by more than 2000 people a whole community to embark, starting essentially now, new blood to inject We are sharing already effort and software. We can do much more: http://concurrency.web.cern.ch Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
28
Back-up slides Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
29
29 Notes Mention Aida (voir mail Laurent) Revoir talk edinburgh Locality Multithreading Revoir analysis slide Reine rouge Plot d’un CPU Pas assez d images a la fin Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.