Report on Vector Prototype

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

The Linux Kernel: Memory Management

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.

RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast – Task parallelism in hybrid configurations – Instruction.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Status of the vector transport prototype Andrei Gheata 12/12/12.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.

R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

GeantV – status and plan A. Gheata for the GeantV team.

GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.

GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Scheduler overview status & issues

Processes and threads.

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Cluster Optimisation using Cgroups

SuperB and its computing requirements

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Operating Systems (CS 340 D)

Sujata Ray Dey Maheshtala College Computer Science Department

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Processes and Threads Processes and their scheduling

Computer Structure Multi-Threading

A task-based implementation for GeantV

Task Scheduling for Multicore CPUs and NUMA Systems

Intro to Processes CSSE 332 Operating Systems

GeantV – Parallelism, transport structure and overall performance

Real-time Software Design

Operating Systems (CS 340 D)

Hyperthreading Technology

Comparison of the Three CPU Schedulers in Xen

Store Recycling Function Experimental Results

Capriccio – A Thread Model

Department of Computer Science University of California, Santa Barbara

Lecture 2: Processes Part 1

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Fast Communication and User Level Parallelism

Erlang Multicore support

Gary M. Zoppetti Gagan Agrawal

Sujata Ray Dey Maheshtala College Computer Science Department

Multiprocessor and Real-Time Scheduling

Presented by Neha Agrawal

CS510 - Portland State University

Chapter 3: Processes.

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 26 September 2012

Simple observation: HEP transport is mostly local ! Locality not exploited by the classical transportation approach Existing code very inefficient (0.6-0.8 IPC) Cache misses due to fragmented code 50 per cent of the time spent in 50/7100 volumes ATLAS volumes sorted by transport time. The same behavior is observed for most HEP geometries.

A playground for new ideas Simulation prototype in the attempt to explore parallelism and efficiency issues Basic idea: simple physics, realistic geometry: can we implement a parallel transport model on threads exploiting data locality and vectorisation? Clean re-design of data structures and steering to easily exploit parallel architectures and allow for sharing all the main data structures among threads Can we make it fully non-blocking from generation to digitization and I/O ? Events and primary tracks are independent Transport together a vector of tracks coming from many events Study how does scattering/gathering of vectors of tracks and hits impact on the simulation data flow Can we push useful vectors downstream ? Start with toy physics to develop the appropriate data structures and steering code Keep in mind that the application should be eventually tuned based on realistic numbers

Volume-oriented transport model We implemented a model where all particles traversing a given geometry volume are transported together as a vector until the volume gets empty Same volume -> local (vs. global) geometry navigation, same material and same cross sections Load balancing: distribute all particles from a volume type into smaller work units called baskets, give a basket to a transport thread at a time Steering methods working with vectors, possibly allowing for future auto-vectorisation Particles exiting a volume are distributed to baskets of the neighbor volumes until exiting the setup or disappearing Like a champagne cascade, but lower glasses can also fill top ones… No direct communication between threads to avoid synchronization issues

Event injection Inject event in the volume containing the IP More events better to cut event tails and fill better the pipeline ! Realistic geometry + event generator

Transport model Track baskets (tracks in a single volume type) Baskets are filled up to a threshold, then injected in a work queue Inject events into a track container taken by a scheduler thread Transport threads pick baskets “first in first out” Physics processes and geometry transport work with vectors Tracks are transported to boundaries, then dumped in a track collection per thread Track collections are pushed to a queue and picked by the scheduler The scheduler holds a track “basket” for each logical volume. Tracks go to the right basket. Scheduler Physics processes 1 1 n n nv nv Geometry transport Track containers (any volume) Event factory

Processing phases – old lessons ideal Initial events injection Sparse regime More and more frequent garbage collections Less tracks per basket Depletion regime Continuous garbage collection New events needed Force flushing some events Optimal regime Constant basket content Garbage collection threshold

Prioritized transport Digitize & I/O thread Generate(Nevents) Worker threads generate transportable baskets pick-up baskets deque recycled baskets flush transport Hits Dispatch & garbage collect thread Inject/replace baskets loop tracks and push to baskets ivolume 1 1 recycle basket 2 2 3 3 Inject priority baskets Priority baskets 4 4 Stepping(tid, &tracks) Digitize(iev) 5 5 Hits 6 6 7 7 8 8 recycled track collections n n Main scheduler Disk Crossing tracks (itrack, ivolume) full track collections Push/replace collection deque

Depletion of baskets (no re-injection) Flush events "Priority" regime Transporting initial buffer of events 0-4 5-9 95-99 Excellent concurrency allover Vectors "suffer" in this regime

Buffered events & re-injection Reusage of event slots for re-injected events keeps memory under control Depletion regime with sparse tracks just few % of the job Concurrency still excellent Vectors preserved much better

Preliminary benchmarks Benchmarking 10+1 threads on a 12 core Xeon HT mode Excellent CPU usage Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues Event re-injection will improve the speed-up

Next steps Include hits, digitization and I/O in the prototype to test the full flow Factories allowing contiguous pre-allocation and re-usage of user structures MyHit *hit = HitFactory(MyHit::Class())->NextHit(); hit->SetP(particle->P()); Introduce realistic EM physics Tuning model parameters, better estimate memory requirements Re-design transport models from a “plug-in” perspective E.g. ability to use fast simulation on per-track basis Look further to auto-vectorization options New compiler features, Intel Cilk array notation, ... Check impact of vector-friendly data restructuring Vector of objects -> object with vectors Push vectors lower level Geometry and physics models as main candidates GPU is a great challenge for simulation code Localize hot-spots with high CPU usage and low data transfer requirements Test performance on data flow machines

Outlook Existing simulation approaches perform badly on the new computing infrastructures Projecting this in future looks much worse... The technology trends motivate serious R&D in the field We have started looking at the problem from different angles Data locality and contiguity, fine-grain parallelism, data flow, vectorization Preliminary results confirming that the direction is good We need to understand what can be gained and how, what is the impact on the existing code, what are the challenges and effort to migrate to a new system…

Thank you!