GeantV – Parallelism, transport structure and overall performance

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
Status of the vector transport prototype Andrei Gheata 12/12/12.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,
Supporting Multi-Processors Bernard Wong February 17, 2003.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
The CMS Simulation Software Julia Yarba, Fermilab on behalf of CMS Collaboration 22 m long, 15 m in diameter Over a million geometrical volumes Many complex.
ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.
Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.
1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.
Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
GeantV – status and plan A. Gheata for the GeantV team.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
INTRODUCTION TO WIRELESS SENSOR NETWORKS
Giovanna Lehmann Miotto CERN EP/DT-DI On behalf of the DAQ team
Scheduler overview status & issues
Productive Performance Tools for Heterogeneous Parallel Computing
Processes and threads.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Software Architecture in Practice
Geant4 MT Performance Soon Yung Jun (Fermilab)
Parallelized JUNO simulation software based on SNiPER
for the Offline and Computing groups
Intel Code Modernisation Project: Status and Plans
Controlling a large CPU farm using industrial tools
Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --
A task-based implementation for GeantV
Task Scheduling for Multicore CPUs and NUMA Systems
LHCb.
Simulation issues for the CWP
GeantV – Parallelism, transport structure and overall performance
Report on Vector Prototype
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Replication Middleware for Cloud Based Storage Service
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Peter Poplavko, Saddek Bensalem, Marius Bozga
Hardware Multithreading
Support for ”interactive batch”
Hybrid Programming with OpenMP and MPI
Hardware Multithreading
Chapter 13: I/O Systems.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

GeantV – Parallelism, transport structure and overall performance Andrei Gheata (CERN) for the GeantV development team

Outlook Motivation & objectives Implementation Concurrent services October 2016 Outlook Motivation & objectives Implementation Concurrent services Tuning knobs and adaptive behavior Interface to accelerators NUMA awareness Global performance and perspectives geant-dev@cern.ch

GeantV – Adapting simulation to modern hardware Classical simulation hard to approach the full machine potential GeantV simulation needs to profit at best from all processing pipelines Single event scalar Embarrassing parallelism Cache coherence – low Vectorization – low (scalar auto-vectorization) Multi-event vector-aware Fine grain parallelism Cache coherence – high Vectorization – high (explicit multi-particle interfaces)

GeantV concurrency: static thread approach Geometry Filters (Fast)Physics Filters Stepper Input queue Vol1 Vol2 Vol3 Voln e+/e- γ Vector stepper Baskets Step sampling MT vector/scalar processing Filter neutrals (Field) Propagator Step limiter reshuffle Outputs Coproc. broker VecGeom navigator Full geometry Simplified geometry Physics sampler Basketizer Phys. Process post-step Secondaries TO SCHEDULER

A Gheata - GeantV scheduler Scheduler features Efficient concurrent basketizers Filtering tracks by several possible locality criteria Giving reasonable size vectors all along the simulation Provide scalable & balanced workload Minimize memory footprint Minimize cool-down phase (tails) Adaptive behavior to maximize performance Dynamic switch of vector/scalar processing Learning dynamically the “important” filters Adjust dynamically event slots to control memory Accommodate additional concurrent processing in the simulation workflow Hits/digits/kinematics I/O Digitization/reconstruction tasks A Gheata - GeantV scheduler

Data handling challenges fEventV fParticleV … Use A What we want before doing work on data … fEventV fParticleV … Compact Move A B What we need to do … … single threaded fEventV fParticleV … Reshuffle A fEventV fParticleV … Basketize A B C … concurrently A Gheata - GeantV scheduler

And the price to pay… 24-core dual socket E5-2695 v2 @ 2.40GHz (IVB). Run-time fraction spent in different parts of GeantV A Gheata - GeantV scheduler

Concurrent services: queues Important as workload balancing tools Several mutex-based/lock free implementations evaluated GeantV queues can work at ~105 transactions/sec Lock free queues are doing great on MacOSX + clang compared to mutex-based ones (50x factor!)

Scheduler “knobs” Keep memory under control Keep the vectors up Limiting number of buffered events Prioritizing events “mostly” transported Using watermark limit to clean baskets Keep the vectors up Optimize vector size Too large: to many pending baskets Too small: inefficient vectorization Trigger postponing tracks or tracking with scalar algorithms Popularity service: basketize only the “important” volumes Adjust also dynamically basket size Headlines only A Gheata - GeantV scheduler

Optimization of scheduling parameters Depends on what needs to be optimized E.g. memory vs. computing time A multivariate problem, probably too early to optimize Development is iterative with short cycles GA approach started to be investigated A Gheata - GeantV scheduler

Monitoring and learning Implemented real-time monitoring tools based on ROOT Very useful to understand model behavior Some parameters can be really adaptive Such as “important” volumes that can feed vectors Basketizing only 10% of volumes in CMS leads to 70% of transport done in vector mode FixedShield102880: 708955 steps * HVQX8780: 462821 steps * ZDC_EMLayer9b00: 83838 steps * BeamTube22b780: 78748 steps * OQUA6780: 62597 steps * QuadInner3300: 56376 steps * ZDC_EMAbsorber9d00: 53672 steps * QuadOuter3700: 52155 steps * QuadCoil3680: 49086 steps * ZDC_EMFiber9e80: 41705 steps A Gheata - GeantV scheduler

Memory control Memory determined by number of tracks “in flight” October 2016 Memory control Memory determined by number of tracks “in flight” Determined by number of events “in flight” Controlling the memory is important for low production cuts Number of secondary particles can explode Currently implemented a policy to delete empty baskets when reaching a memory watermark Not fully effective, but keeping the memory constant Extra levers: Reducing dynamically the number of events in flight (possible with new event server) Prioritizing transport of low energy tracks queued baskets memory tracks in flight geant-dev@cern.ch

Optimizations for dense physics: reusing tracks If interacting in the current step, no need to re-basketize (same volume) Recycle the input basket Large gain for dense physics Normally the basketizer becomes fully blocking Large part of tracks can be reused in the same thread to release load A Gheata - GeantV scheduler

Integration with task-based experiment frameworks Some experiments (e.g. CMS) adopted task-based frameworks Integrate GeantV in a task-based workflow is very important (and now possible) Several scenarios invoking GeantV as a task possible, e.g: Full/fast simulation (GeantV) Particle filter (type, region, energy, …) Digitization (MC truth + det. response) Event Generator/ Reader Tracking/ Reconstruction Experiment framework

Framework: GeantV internal flow in the task approach user task inject event EventServer Transport task may be further split into subtasks 0..n Initial task Top level task spawning a “branch” in TBB tree of tasks Transport task Transports one basket for one step reuse tracks keeping locality transported tracks output input User scoring Basketizer(s) concurrent service injects full baskets enqueue basket Basket queue concurrent service command: dump all your baskets event finished? inspect I/O task Garbage collector /Flushing/prioritizing task Flow control task event finished? queue empty? memory threshold queue empty? User Digitizers event finished?

Integration with user framework October 2016 Integration with user framework ongoing R&D RunSimulation Task StartRunTask EndRunTask Configure GeantV, start event loop task Initialize GeantV if needed, trigger user event injection, start run Optional post-simulation task GeantRunManager RunSimulation InitTask InitTask SpawnUserEndRunTask() InitTask InitTask fApplication CMSSWApplication GeantV GeantApplication TBB fGenerator GeantV task system CMSSWGenerator PrimaryGenerator Event server NextEvent() AddTrack(GeantTrack &atrack) geant-dev@cern.ch

Preliminary TBB results A first implementation of a task-based approach for GeantV using TBB was deployed. Connectivity via FeederTask, steering concurrency by launching InitialTask(s) Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512 AVX2 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 2 sockets x 8 physical cores KNL/AVX512

GeantV beyond the socket: accelerators GeantV scheduler can communicate with arbitrary device brokers Getting work and processing in native mode or processing steps of work and sending data back to the host Implemented so far: CUDA broker, KNC offload interface. CPU stepper (multithreading) Geometry Physics Basketizer GPU broker KNC broker (offload) MPI broker Generator Device stepper

Topology-aware GeantV Tracks Transport Basketizer0 Scheduler0 Global basketizer Tracks Transport Basketizer1 Scheduler1 Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Possible to use pinning/NUMA allocators to increase locality Multi-propagator mode running one/more clusters per quadrant Loose communication between NUMA nodes at basketizing step Implemented, currently being integrated Tracks Transport Basketizer2 Scheduler2 Tracks Transport Basketizer3 Scheduler3

Handling sub-NUMA clustering Known scalability issues (see next) of full GeantV due to fine grain synchronization in re- basketizing New approach deploying several propagators with SNC implemented Objectives: improved scalability at the scale of KNL and beyond, address HPC mode with MPI event servers (workload balancing) + non-homogenous resources Now implemented GeantV run manager GeantV propagator NUMA discovery service (libhwloc) Tracks Transport Basketizer0 Scheduler0 Basketizer1 Scheduler1 Basketizer2 Scheduler2 Basketizer3 Scheduler3 Global basketizer GeantV propagator GeantV propagator (…) Scheduler Basketizer node Scheduler Basketizer Scheduler Basketizer socket socket socket

Multi-propagator performance October 2016 Multi-propagator performance To be redone geant-dev@cern.ch

GeantV plans for HPC environments Standard mode (1 independent process per node) Always possible, no-brainer Possible issues with work balancing (events take different time) Possible issues with output granularity (merging may be required) Multi-tier mode (event servers) Useful to work with events from file, to handle merging and workload balancing Communication with event servers via MPI to get event id’s in common files Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service

Validation and performance for LHC setups October 2016 Validation and performance for LHC setups Exercise at the scale of LHC experiments (CMS & LHCb) Full geometry converted to VecGeom + uniform magnetic field Tabulated physics, fixed 1MeV Measuring several cumulative observables in sensitive detectors Energy deposit and particle flux densities for p, π, K Comparing GeantV single threaded with the corresponding Geant4 application Geant4.10.2, special physics list using tabulated physics Comparable signal, number of secondaries, total steps and physics steps within statistical fluctuations. TG4/TGV = 3.5 TG4/TGV = 2.5 Speed-up due to: 1.5 - Infrastructure optimizations 2.4 - Algorithmic improvements in geometry 3.5 - Extra locality/vectorization To be profiled geant-dev@cern.ch

Future work SOA->AOS integration Tuning for many-core October 2016 Future work SOA->AOS integration Tuning for many-core R&D and testing in HPC environments Adapting to new architectures (Power8) Integration with physics and optimization: R-K propagator and multiple scattering geant-dev@cern.ch

Conclusions GeantV core delivering already a part of the hoped performance Many optimization requirements, now understanding how to handle most of them More performance to be extracted from vectorization soon Additional levels of locality (NUMA) available in modern HW Topology detection available in GeantV, currently being integrated Integration with task-based HEP frameworks now possible A TBB-enabled GeantV version ready Studying more efficient use of HPC resources Using a multi-tier approach for better workload balancing Very promising results in complex applications Gains from infrastructure simplification, geometry and locality/vectorization