Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Chapter 17 Parallel Processing.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Computer System Architectures Computer System Software
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Status of the vector transport prototype Andrei Gheata 12/12/12.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Andreas Morsch, CERN EP/AIP CHEP 2003 Simulation in ALICE Andreas Morsch For the ALICE Offline Project 2003 Conference for Computing in High Energy and.
Supporting Multi-Processors Bernard Wong February 17, 2003.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.
Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.
Toward Geant4 version 10 Makoto Asai (SLAC PPA/SCA) For the Geant4 Collaboration Geant4 Technical Forum December 6 th, 2012.
Next Generation of Apache Hadoop MapReduce Owen
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
GeantV – status and plan A. Gheata for the GeantV team.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Introduction to Operating Systems Concepts
INTRODUCTION TO WIRELESS SENSOR NETWORKS
Scheduler overview status & issues
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Processes and threads.
(on behalf of the POOL team)
Computing models, facilities, distributed computing
Geant4 MT Performance Soon Yung Jun (Fermilab)
Parallel Processing - introduction
Intel Code Modernisation Project: Status and Plans
Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --
A task-based implementation for GeantV
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
GeantV – Parallelism, transport structure and overall performance
GeantV – Parallelism, transport structure and overall performance
Multi-Processing in High Performance Computer Architecture:
Morgan Kaufmann Publishers
Report on Vector Prototype
Multi-Processing in High Performance Computer Architecture:
Support for ”interactive batch”
Lecture Topics: 11/1 General Operating System Concepts Processes
Simulation in a Distributed Computing Environment
Hybrid Programming with OpenMP and MPI
CS510 - Portland State University
Chapter 4: Threads.
Mattan Erez The University of Texas at Austin
Presentation transcript:

Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

HEP simulation: where should we go? The LHC uses more than 50% of its distributed computing power for simulations and related jobs There is a tremendous need from LHC, likely to increase with factors Technology is evolving faster than our software Our production code is able to use a smaller and smaller fraction of the power of the machines we run onto Re-engineering the code towards fine grain parallelism can bring improvements with large factors GeantV project aims for x3-x5 speedup understanding hard limits for more Fast simulation integrated seamlessly The work required is far from trivial, but not optional Transistors density evolves by Moore’s law, but HEP programs have to evolve to profit: parallelism has to be enabled at instruction level Intel® Many Integrated Core Architecture (MIC - KNL)

GeantV – Adapting simulation to modern hardware Classical simulation hard to approach the full machine potential GeantV simulation can profit at best from all processing pipelines Stack approach Single event transport Embarrassing parallelism Cache coherency – low Vectorization – low (scalar auto-vectorization) Basket approach Multi event transport Fine grain parallelism Cache coherency – high Vectorization – high (explicit multi-particle interfaces) 3

GeantV targets Portable performance GeantV developed a thin layer of back-ends allowing to exploit hardware at its best, maintaining portability High-performance re-engineered components VecGeom – a fully vector-aware geometry modelling package, aiming at a future replacement of the geometry in Geant4 and ROOT VecPhys – a highly optimized EM physics package having the same capability as Geant4 but better performance Embedded fast simulation capability Provide combined full/fast simulation hooks and examples to drive further experiment customization within the framework Tests from the onset on large setups (LHC-like) Demonstrate performance compared to standard simulation approach GeantV scheduler CPUGPUPhiXXXAtom 4

Integration with task-based frameworks Some experiments (e.g. CMS) adopted a task based aproach Integrate GeantV in the simulation->reconstruction->analysis workflow is very important (and now possible) Scenario (task flow): Event Generator/ Reader Particle filter (type, region, energy, …) Full simulation (GeantV) Fast simulation (GeantV or exp. framework) Digitization (MC truth + det. response) Tracking/ Reconstruction Experimental data Analysis 5

Feeder task Reads from file a number of events. Invokes the concurrent basketizer service Basketizer(s) concurrent service injects full baskets Transport task Transports one basket for one step Basket queue concurrent service spawn inject particle Flow control task event finished? queue empty? enqueue basket input spawn Garbage collector Forces partially filled baskets into the basket queue to boost concurrency inspect spawn command dump all your baskets reuse tracks keeping locality output transported tracks Digitizer task This is a user task working on “hits” data Scoring This is a user task reading track info and creating ”hits” I/O task Write data (hits, digits, kinematics) on disk Framework: GeantV moving to a task approach Transport task may be further split into subtasks spawn queue empty? event finished? spawn 6 Fully re-structured GeantV to support both a “static” thread approach and TBB tasks

Preliminary TBB results A first implementation of a task-based approach for GeantV using TBB was deployed. Connectivity via FeederTask, steering concurrency by launching InitialTask(s) Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512 Some more code restructuring and tuning needed AVX2 Intel(R) Xeon(R) CPU E GHz 2 sockets x 8 physical cores KNL/AVX512 7

Re-structuring of GeantV for sub-NUMA clustering Known scalability issues (see next) of full GeantV due to fine grain synchronization in re- basketizing New approach deploying several propagators with SNC implemented Objectives: improved scalability at the scale of KNL and beyond, address HPC mode with MPI event servers (workload balancing) + non-homogenous resources Now debugging and tuning GeantV propagator Scheduler Basketizer GeantV run manager Scheduler Basketizer Scheduler Basketizer GeantV propagator GeantV propagator (…) NUMA discovery service (libhwloc) node socket 8

Scalability (old model) 9 Intel Xeon Phi GHz Xeon(R) E GHz

Multi-propagator mode Launch more than one propagators working with a fixed number of threads each Reuse geometry, cross sections, … Same as multi-process, but using work stealing for balancing NUMA awarness not yet added Adds one level in complexity, needs more tuning 10 Xeon(R) E GHz

NUMA aware GeantV Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Possible to use pinning/NUMA allocators to increase locality Multi-propagator mode running one/more clusters per quadrant Loose communication between NUMA nodes at basketizing step Currently being integrated Tracks Transport Basketizer 0 Scheduler 0 Tracks Transport Basketizer 1 Scheduler 1 Tracks Transport Basketizer 2 Scheduler 2 Tracks Transport Basketizer 3 Scheduler 3 Global basketizer 11

GeantV plans for HPC environments Standard mode (1 independent process per node) Always possible, no-brainer Possible issues with work balancing (events take different time) Possible issues with output granularity (merging may be required) Multi-tier mode (event servers) Useful to work with events from file, to handle merging and workload balancing Communication with event servers via MPI to get event id’s in common files Event feeder Node 1 Transport Numa 0 Numa 1 Event feeder Node 2 Transport Numa 0 Numa 1 Event server Node mod[N] Transport Numa 0 Numa 1 Merging service Event feeder Node 1 Transport Numa 0 Numa 1 Event feeder Node 2 Transport Numa 0 Numa 1 Event server Node mod[N] Transport Numa 0 Numa 1 Merging service 12

Conclusions GeantV needs to address parallelism in a fine-grained approach to address locality (cache coherence, vectorization) efficiently Amdahl overheads due to that to be compensated by a thread clustering approach Implementation ready – currently fixing/tuning The improvement effect visible in the preliminary version Additional levels of locality (NUMA) available in modern HW Topology detection available in GeantV, currently being integrated Integration with task-based HEP frameworks now possible A TBB-enabled GeantV version ready Studying more efficient use of HPC resources Using a multi-tier approach for better workload balancing 13