Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Upgrading the CMS simulation and reconstruction David J Lange LLNL April CHEP 2015D. Lange.
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
FCC Software Status Report from a User’s Perspective Colin Bernet (IPNL) 18 March 2015 Code Contributors: Michele De Gruttola, Benedikt Hegner, Clément.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
February 19th 2009AlbaNova Instrumentation Seminar1 Christian Bohm Instrumentation Physics, SU Upgrading the ATLAS detector Overview Motivation The current.
L3 Filtering: status and plans D  Computing Review Meeting: 9 th May 2002 Terry Wyatt, on behalf of the L3 Algorithms group. For more details of current.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Charles Leggett The Athena Control Framework in Production, New Developments and Lessons Learned.
ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
5 May 98 1 Jürgen Knobloch Computing Planning for ATLAS ATLAS Software Week 5 May 1998 Jürgen Knobloch Slides also on:
Optimising Cuts for HLT George Talbot Supervisor: Stewart Martin-Haugh.
19 November 98 1 Jürgen Knobloch ATLAS Computing ATLAS Computing - issues for 1999 Jürgen Knobloch Slides also on:
CERN Physics Database Services and Plans Maria Girone, CERN-IT
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
ATLAS Trigger Development
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.
Predrag Buncic Future IT challenges for ALICE Technical Workshop November 6, 2015.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Ian Bird CERN, 17 th July 2013 July 17, 2013
Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.
Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Wesley Smith, Graeme Stewart, June 13, 2014 ECFA - TOOC - 1 ECFA - TOOC ECFA - TOOC ECFA HL-LHC workshop PG7: Trigger, Online, Offline and Computing preparatory.
Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 01/12/2015.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
CWG7 (reconstruction) R.Shahoyan, 12/06/ Case of single row Rolling Shutter  N rows of sensor read out sequentially, single row is read in time.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Tackling I/O Issues 1 David Race 16 March 2010.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
Concurrency and Performance Based on slides by Henri Casanova.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
1 Reconstruction tasks R.Shahoyan, 25/06/ Including TRD into track fit (JIRA PWGPP-1))  JIRA PWGPP-2: Code is in the release, need to switch setting.
16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.
AliRoot survey: Reconstruction P.Hristov 11/06/2013.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,
October 19, 2010 David Lawrence JLab Oct. 19, 20101RootSpy -- CHEP10, Taipei -- David Lawrence, JLab Parallel Session 18: Software Engineering, Data Stores,
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Cluster Optimisation using Cgroups
SuperB and its computing requirements
Multicore Computing in ATLAS
for the Offline and Computing groups
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Commissioning of the ALICE HLT, TPC and PHOS systems
ALICE Computing Model in Run3
ALICE Computing Upgrade Predrag Buncic
CSCI1600: Embedded and Real Time Software
Computing at the HL-LHC
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Software for HL-LHC David Rousseau, LAL-Orsay, for the TDOC preparatory group 3 Oct 2013

2 TDOC Membership  ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic  ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis  CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer  LHCb: Renaud Legac, Niko Neufeld Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

3 LHC Context  HL-LHC events: omore complex (pile-up ˜150 rather than ˜25 in run 1, also energy x ˜2)  No reliable estimate today of the impact on CPU, as existing code shows non linear divergence. Indicatively, multiplicity increases by a factor 8. ohigher read-out rates (factor ˜10)  Flat resources (in euros) and Moore’s law give us a factor 10 in CPU power (if and only if we can use the processors as efficiently as today!)  handling HL-LHC event added complexity, and maintenance/improvement of processor efficiency orely on software improvements  Run 2 already has some of these issues in a milder way Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

4 Major crisis ? Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Ariane V maiden flight Use of same software on a faster rocket…

5 CPU Context  More cores per CPU o(and possibly specialised cores) oSo far largely ignored : treat one core as one full CPU. Fine for o(10) cores oWill break down for o(100) cores  calling for macro-parallelism, handled at framework level oRelatively small impact on sw developers if framework is smart enough oHowever does not help throughput if enough I/O and memory  Micro-parallelism oModern cores are not used efficiently by HEP software. Many cache misses. SIMD (Single Instruction Multiple Data), ILP (Instruction Level Parallelism) etc… to be used oGPU oExpert task. Focus on hot spots Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

6 LHC experiments code base  LHC experiments code base o˜5 millions line of code per experiment owritten by up to 1000 people per experiment since y2000 oup to 300 people active per experiment (but dropping in ) (new blood needed)  Who are they ? oVery few software engineers oFew physicists with very strong software expertise oMany physicists with ad-hoc software experience  All these people need take part to the new transition Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

7 One core one job  Today, typical grid workhorse is a 16GB memory, 8 core CPU (2GB/core)  Each core is adressed by the batch system as a separate processor  Each job process event one by one, running one by one a finite number of algorithms  One processor may handle simultaneously e.g. one Atlas reco job, 3 CMS simulation job, and 4 LHCb analysis jobs  This works (today), however disorganised competition for resources like memory, I/O Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

8 One processor one job  Available today (GaudiMP, AthenaMP) but not used in production  One job goes to one processor (which is completely free)  The framework distributes event processing to all cores, while sharing common memory (code, conditions,…)  No change to algorithmic code required (in principle)  ˜50% reduction of memory achieved (w.r.t. independent jobs) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

9 Real life  Direct Acyclic Graph extracted from real reco job  Today, algorithms run sequentially Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

10 Event level parallelism  framework schedules intelligently the algorithms from their dependency graph  e.g. run tracking in parallel with calorimeter, then electron ID  in practice too few algorithms can run in parallel  most cores remain idle Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

11 Event level concurrent event processing  The framework process several events simultaneously…  …distributes intelligently algorithms to cores  can allocate more cores to slowest algorithms  can optimise use of specialised cores Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013  In addition to algorithm scheduling, the framework provides services to pipeline access to resources (I/O, conditions, message logging…)  Algorithms should be thread safe : no global object (except through the framework), only use thread safe services and libraries  Algorithms do not need to handle threads themselves  regular software physicist can write algorithms with ad-hoc training a.k.a the Holy Grail

12 Processing steps Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Generation Full Simulation ReconstructionAnalysi s Detector Fast Simulation

13 Generators  Each generator is a different beast, developed by a very small team (theorists usually)  Some commonalities: HepData, HepMC event record  Many generators N…NLO needed for many channels (even if “just” for systematics). (e.g.˜40 generators interfaced in Atlas)  Significant work in each experiment to interface each generator to the experiment framework  Overall CPU consumption not negligible (up to 10% of total grid)  Little input, little output, no database connection: candidate for parasitic use of HPC Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

14 Simulation  Dominates CPU consumption on the grid  HL-LHC : 10x read-out rate  10x n events simulated ? Even more due to increased requirement on precision  Continue effort on G4 optimisation: oG multi-threaded to be released Dec 2013 oRe-thinking core algorithms with vectorisation in mind  Rely on blend of G4/Fast sim/Parametric. Challenge : the optimal blend is very analysis dependent. But only one pot of resources. Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 Geant4 4-vector smearing 1000s/evt ms/evt

15 Reconstruction  Reconstruct analysis objects from raw data  Trigger code is looking more and more like (offline) reconstruction code: oaccess to more info on the event odesire to have the exact same algorithms online and offline for better control of systematics ostill one major difference w.r.t. CPU optimisation: trigger optimised to reject fast bad events rather than reconstruction of good events Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

16 Impact of pileup on reco Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct Tier Tier0 Atlas CPU (s/event) Reconstruction CPU time at Tier0 vs pileup Fight non-linearities Improve everywhere

17 CPU per domain (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013  Tracking dominant in absolute and also non linear behavior  However, the sum of other domains (calorimeter, jets, muons,…) is similar  need to improve across the board  (CMS is more tracking dominated) Atlas

18 20% of algorithms responsible for 80% of CPU time: Pareto at work… CPU per algorithm (reco) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 However, at high pile up and before optimisation a handful of algorithms dominate

19 Analysis software  Two levels of analysis (often mixed up) oEvent combinatorics (e.g. compute Higgs candidate mass per event):  Need explicit event loop  Root or dedicated frameworks CMSSW, Gaudi/Athena oFinal analysis  no explicit event loop  TTree->Draw  histogram manipulations  Limit/signal setting RooFit/RooStat  Root definitely the framework here.  Order kB per event, a small fraction of events used, not CPU intensive  I/O bounded jobs  In the future, x10 larger disks, x10 larger bandwidth but disk access rate unchanged to a few 100Hz  sophisticated data organisation/access methods will be needed Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

20 dedicated Ntuple plot KiloBytes Giga/MegaBytes root Analysis cycle RAW Analysis objects How much time does it take to redo a plot ? (binning, cut, new variable, new calibration, new algorithm,…) How much time does it take to redo a full analysis ? (properrly reweighted plots and values after all corrections) 1-3 months Every 4-12 months Few seconds every minute x Petabytes Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013 analysis models Petabytes One or more dedicated intermediate datasets with event selection and information reduction. Balance between completeness, resource usage, speed to use and speed to redo. Difficulty : ˜100 analysis streams/teams sharing same resources

21 Software for GPU  Graphic co processor massively parallel, up to x100 speed-up on paper  In practice, task must be prepared by traditional CPU and transferred to GPU  Successfully used in HEP for very focussed usage e.g. Alice HLT tracking (gain factor 3 in farm size)  Code need to be written from scratch using libraries such as Cuda, etc…  …and largely rewritten/retuned again for different processors, generation  Need to maintain a second, traditional, version of the code for simulation  Usage on the grid unlikely/difficult due to the variety of hardware  In the future, expect progress in generic libraries (e.g. OpenCL) which would ease maintenance (one code for all processors) at an acceptable loss in performance Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

22 Common software  Can we have more common software ?  One monster software with if (ATLAS) do_this(); if (LHCb) do_that(); ? certainly not…  Still, we can most likely do more that what we are doing right now  Note that we are largely running on the same Grid (even the same processor can run at one time some cores with Atlas jobs, some with CMS, some with LHCb)  Three angles: oFramework : introducing parallelism at different levels oFoundation libraries oHighly optimised HEP Toolboxes  Even if we do not share software, it is essential we share experience, developer tools and infrastructure (already happening e.g. in the concurrency forum)  We should share tutorials, e.g. do’s and dont’s to write thread safe code, proper use of vector libraries etc… Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

23 Frameworks  Hard to argue different HEP experiments have really different needs w.r.t their framework  In practice, the LHC experiments are using different frameworks (except Atlas and LHCb with different flavours of Gaudi)  Not the place to dwell on the many reasons for this  Also changing to a new framework is a huge change (compared to a drop-in replacement of e.g. minimisation algorithm)  Still if not code, many lessons can be shared Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

24 Toolboxes : FastJet case  FastJet success due to simultaneous occurrence of: oVery clever sequential-recombination algorithm (order nxlog(n) vs n^3) « just » a speed-up but a huge one at LHC multiplicities oNew physics algorithm AntiKt became a de facto standard oMany algorithms and plug-ins added later on oEasy to interface as a drop-in replacement of previous in house jet algorithms  No attempt to oversell: FastJet library still interfaced with experiment specific code o1) Mapping EDM back and forth o2) Preparing input: calibration, noisy channel masking o3) Massaging output : more calibration o2 and 3 are physics loaded, depends of the detector peculiarities and high level choices (e.g. calorimeter based jet vs particle flow jet) Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

25 Toolboxes (2)  The algorithm should be sufficienty CPU intensive  Interfacing to and fro (from experiment specific representation to the Toolbox representation) should use negligible resource (CPU and memory) and little manpower  Most efficient data representation to be used (e.g. adjacent in memory)  Vectorisation and specific processor instruction used as much as possible (as few experts can do it)  Optimisation to the latest compilers and processors instruction set  Possible examples: oTrack extrapolating in inhomogeneous magnetic field, fitting (on the other hand pattern recognition most often require experiment specific implementation to maximise both CPU and physics performance) oCalorimeter clustering, different flavours e.g. sliding window, topological cluster  Benefit: oNot necessarily breakthrough algorithm like FastJet oFor LHC experiments where these algorithms already exist, share the CPU optimisation manpower (experts) oFor new experiment, no need to reinvent the wheel Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

26 Foundation libraries  Can study drop-in replacement of low level libraries like (examples): oArithmetic functions oMemory management oRandom number generators oGeometry (e.g. CLHEP) and 4-vectors (e.g. TLorentzVector)  Sometimes, even the right set of compiler/linker options can bring a few percent for free Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

27 Conclusion  HL-LHC : high pile-up and high read-out rate  large increase of processing needs  With flat resource (in euros), and even with Moore’s law holding true (likely, provided we maintain/improve efficient use of processors), this is not enough  Future evolution of processors: many cores with less memory per core, more sophisticated processors instructions (micro-parallelism), possibility of specialised cores  oParallel framework to distribute algorithms to cores, in a semi- transparent way to regular physicist software developer oOptimisation of software to use high level processors instructions, especially in identified hot spots (expert task)  LHC experiments code base more than 10 millions of line of code, written by more than 2000 people  a whole community to embark, starting essentially now, new blood to inject  We are sharing already effort and software. We can do much more: Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

Back-up slides Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013

29 Notes  Mention Aida (voir mail Laurent)  Revoir talk edinburgh  Locality  Multithreading  Revoir analysis slide  Reine rouge  Plot d’un CPU  Pas assez d images a la fin Rousseau, HL-LHC Aix-les-bains, TDOC session, 3rd Oct 2013