Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.

Slides:

Advertisements

Similar presentations

Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

490dp Synchronous vs. Asynchronous Invocation Robert Grimm.

Chapter 12 Pipelining Strategies Performance Hazards.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CMPE 421 Parallel Computer Architecture

1 Robert Wijnbelt Health Check your Database A Performance Tuning Methodology.

RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.

Multi-Core Architectures

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast – Task parallelism in hybrid configurations – Instruction.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

Status of the vector transport prototype Andrei Gheata 12/12/12.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,

Server to Server Communication Redis as an enabler Orion Free

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

Performance Performance

GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

GeantV – status and plan A. Gheata for the GeantV team.

GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.

GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.

16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.

Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Scheduler overview status & issues

Processes and threads.

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

SuperB and its computing requirements

Operating Systems (CS 340 D)

Multicore Computing in ATLAS

A task-based implementation for GeantV

Parallel Algorithm Design

5.2 Eleven Advanced Optimizations of Cache Performance

Report on Vector Prototype

Operating Systems (CS 340 D)

Hyperthreading Technology

Presentation transcript:

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts

Outline Why changing particle transport ? Requirements for the LHC upgrade Ideas embedded in a prototype Preliminary results and lessons Moving forward

Classical particle transport For every particle most of the data structures and code are being traversed Geometry navigation Material – X-section tables Physics processes Navigating very large data structures Irregular access patterns OO abused: very deep instruction stack Cache misses: not enough spatial and temporal locality Low IPC ( ) Event-level parallelism will better use resources but won’t improve the above This approach is hard to optimize in a fine-grain parallelism environment. The current state of the art physics could profit from a transport and data model based on locality and vectorized computing

Technology challenge Talk by Intel HEP applications are not doing great… CPI: load and store instructions %: % resource stalls % (of cycles): % branch instructions % (approx): % computational FP instr. %: 0.026%

MC requirements for the LS2 upgrade (to be commissioned by ~2018) Rate of events x50-x100 higher than today Ratio events/MC same as now We will need to be able to simulate more events and much faster – The increase in computing resources will not scale the same We need a simulation framework performing much better that what we can do today ! – Exploiting data locality and parallelism at multiple levels We started R&D in the field in the form of a prototype – Exploring many new ideas in a small group F.Carminati, J.Apostolakis, R.Brun, A.Gheata – Trying to materialize some of these ideas in a O (2000) LOC implementation

Simple observation: HEP transport is mostly local ! ATLAS volumes sorted by transport time. The same behavior is observed for most HEP geometries. 50 per cent of the time spent in 50/7100 volumes Talk by René last year

A playground for new ideas Simulation prototype in the attempt to explore parallelism and efficiency issues – Basic idea: simple physics, realistic geometry: can we implement a parallel transport model on threads exploiting data locality and vectorisation? – Clean re-design of data structures and steering to easily exploit parallel architectures and allow for collaborative work among resources – Can we achieve a continuous data flow from generation to digitization and I/O without the need to block between different stages? Events and primary tracks are independent – Transport together a vector of tracks coming from many events – Study how does scattering/gathering of vectors of tracks and hits impact on the simulation data flow Can we push useful vectors downstream ? Start with toy physics to develop the appropriate data structures and steering code – Keep in mind that the application should be eventually tuned based on realistic numbers

Volume-oriented transport model We implemented a model where all particles traversing a given geometry volume are transported together as a vector until the volume gets empty – Same volume -> local (vs. global) geometry navigation, same material and same cross sections – Load balancing: distribute all particles from a volume type into smaller work units called baskets, give a basket to a transport thread at a time – Steering methods working with vectors, possibly allowing for future auto-vectorisation Particles exiting a volume are distributed to baskets of the neighbor volumes until exiting the setup or disappearing – Like a champagne cascade, but lower glasses can also fill top ones… – Wait the glass fill before drinking the champagne… – Independent processing once the work is distributed

Event injection Realistic geometry + event generator Inject event in the volume containing the IP More events better to cut event tails and fill better the pipeline ! merging tails...

Track containers (any volume) Transport model Physics processes Geometry transport n Event factory n nvnv nv Track baskets (tracks in a single volume type) Inject events into a track container taken by a scheduler thread The scheduler holds a track “basket” for each logical volume. Tracks go to the right basket. Baskets are filled up to a threshold, then injected in a work queue Transport threads pick baskets “first in first out” Physics processes and geometry transport work with vectors Tracks are transported to boundaries, then dumped in a track collection per thread Track collections are pushed to a queue and picked by the scheduler Scheduler

Initial events injection Optimal regime Constant basket content Sparse regime More and more frequent garbage collections (drink the glass empty…) Less tracks per basket Garbage collection threshold Depletion regime Continuous garbage collection New events needed Force flushing some events Processing phases – old lessons

Prioritized transport transport pick-up baskets transportable baskets recycled baskets full track collections recycled track collections Worker threads Dispatch & garbage collect thread Crossing tracks (itrack, ivolume) Push/replace collection Main scheduler n Inject priority baskets recycle basket ivolume loop tracks and push to baskets n Stepping(tid, &tracks) Digitize & I/O thread Priority baskets Generate(N events ) Hits Hits Digitize(iev) Disk Inject/replace baskets deque generate flush

Depletion of baskets (no re-injection) Flush events Transporting initial buffer of events "Priority" regime Excellent concurrency allover Vectors "suffer" in this regime

Buffered events & re-injection Reusage of event slots for re-injected events keeps memory under control Depletion regime with sparse tracks just few % of the job Concurrency still excellent Vectors preserved much better

Preliminary benchmarks HT mode Excellent CPU usage Benchmarking 10+1 threads on a 12 core Xeon Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues Event re-injection will improve the speed-up

GPU ? Cannot ignore the TFLOPS delivered, neither the good ratio FLOPS/Watt(or $) Different computing patterns, transaction model, requirements for the data and code structure – Can our models fit these resources ? – Can we use GPU in an opportunistic task model ? Many CPU GPU workflows suffer from memory bus latency – Investigate non-blocking work flow against blocking transactions, data flow machines, efficient scatter/gather technologies Will be definitely a challenge

User data and code Well-tuned performance can be easily spoiled by inconsequential user code – Hits, digits – large user data structures usually scattered in memory: we aim for pooling memory resources via factories – The I/O has to be well controlled and scalable – the framework should provide at least the tools to queue I/O transactions in an efficient manner The callbacks to user code should execute asynchronous in competition with the global task queue – The passed information should be stateful and fully decoupled from the state of the transport machinery

Factory manager Blocks in use User data factories Factory Empty blocks Block thread 1 thread 2 thread N GetFactory MyHit *hit = factory->NextHit() Factory

Next steps Include hits, digitization and I/O in the prototype to test the full flow – Factories allowing contiguous pre-allocation and re-usage of user structures Introduce realistic EM physics – Tuning model parameters, better estimate memory requirements Re-design transport models from a “plug-in” perspective – E.g. ability to use fast simulation on per-track basis Look further to auto-vectorization options – New compiler features, Intel Cilk array notation,... – Check impact of vector-friendly data restructuring Vector of objects -> object with vectors Push vectors lower level – Geometry and physics models as main candidates GPU is a great challenge for simulation code – Localize hot-spots with high CPU usage and low data transfer requirements – Test performance on data flow machines

Outlook The classic simulation approach performs badly on the new computing infrastructures – Projecting this in future looks worse... – The technology trends motivate serious R&D in the field We have started looking at the problem from different angles – Data locality and contiguity, fine-grain parallelism, data flow, vectorization – Preliminary results confirming that the direction is good We need to understand what can be gained and how, what is the impact on the existing code, what are the challenges and effort to migrate to a new system…

Thank you!