Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Slides:

Advertisements

Similar presentations

Geant4 v9.2p02 Speed up Makoto Asai (SLAC) Geant4 Tutorial Course.

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.

1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast – Task parallelism in hybrid configurations – Instruction.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Status of the vector transport prototype Andrei Gheata 12/12/12.

CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,

Computing Performance Recommendations #13, #14. Recommendation #13 (1/3) We recommend providing a simple mechanism for users to turn off “irrelevant”

3D Viewers Two main uses: –Detector/event exploration – interactivity priority (15fps min). –Generate presentation material (still/movie renders) – quality.

Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.

NETW 3005 Threads and Data Sharing. Reading For this lecture, you should have read Chapter 4 (Sections 1-4). NETW3005 (Operating Systems) Lecture 03 -

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

DISTRIBUTED COMPUTING

GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.

I MAGIS is a joint project of CNRS - INPG - INRIA - UJF iMAGIS-GRAVIR / IMAG Efficient Parallel Refinement for Hierarchical Radiosity on a DSM computer.

R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.

Introduction to operating systems What is an operating system? An operating system is a program that, from a programmer’s perspective, adds a variety of.

Analysis framework plans A.Gheata Offline week 13 July 2011.

Vector Prototype Status Philippe Canal (For VP team)

I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

GeantV – status and plan A. Gheata for the GeantV team.

GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.

GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.

Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Scheduler overview status & issues

Processes and threads.

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Copyright ©: Nahrstedt, Angrave, Abdelzaher

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Processes and Threads Processes and their scheduling

Multicore Computing in ATLAS

A task-based implementation for GeantV

Task Scheduling for Multicore CPUs and NUMA Systems

GeantV – Parallelism, transport structure and overall performance

Report on Vector Prototype

A Lock-Free Algorithm for Concurrent Bags

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012

Simple observation: HEP transport is mostly local ! ATLAS volumes sorted by transport time. The same behavior is observed for most HEP geometries. 50 per cent of the time spent in 50/7100 volumes

A playground for new ideas Simple simulation prototype to help exploring parallelism and efficiency issues – Basic idea: minimal physics to start with, realistic geometry: can we implement a parallel transport model on threads exploiting data locality and vectorisation? – Clean re-design of data structures and steering to easily exploit parallel architectures – Can we make it fully non-blocking from generation to digitization and I/O ? Events and primary tracks are independent – Work chunk: basket containing a vector of tracks – Mixing tracks from different events to avoid tails and have reasonably-sized vectors – Study how does scattering/gathering of vectors impact the simulation data flow Toy physics at first, more realistic EM & hadronic processes to continue with – The application should be eventually tuned based on realistic numbers New transport model more “detector element”-oriented, profiting from the cached data structures – geometry and x-section wise Where to go from there – Re-design the particle stack and the I/O – Re-design transport models from a “plug-in” perspective E.g. ability to use fast simulation on per track basis – Understand what can be gained and how, what is the impact on the existing code, what are the changes and effort to migrate to a new system…

Volume-oriented transport model We implemented a model where all particles traversing a given geometry volume are transported together as a vector until the volume gets empty – Same volume -> local (vs. global) geometry navigation, same material and same cross sections – Load balancing: distribute all particles from a volume type into smaller work units called baskets, give a basket to a transport thread at a time Particles exiting a volume are distributed to baskets of the neighbor volumes until exiting the setup or disappearing – Like a champagne cascade, but lower glasses can also fill top ones… – No direct communication between threads to avoid synchronization issues

The beginning Realistic geometry + event generator Inject event in the volume containing the IP More events better to cut event tails and fill better the pipeline !

A first approach Work queue Scatter all injected tracks to baskets. Only baskets above some threshold are transported. Transport threads pick-up baskets from the work queue Physics processes Geometry transport Particles(i 0,…,i n ) Physics processes and geometry transport called with vectors of particles Each thread transports its basket of tracks to the boundaries of the current volume Move crossing tracks to a buffer, then picks-up the next basket from the queue

First version required synchronization… Work queue POP_CHUNK QUEUE_EMPTY ParticleBuffer FLUSH Generation = Pop work chunks until the queue is empty Synchronization point: flush transported particle buffer and sort baskets according content Recompute work chunks and start transporting the next generation of baskets

Processing phases Initial events injection Optimal regime Constant basket content Sparse regime More and more frequent garbage collections Less tracks per basket Garbage collection threshold Depletion regime Continuous garbage collection New events needed Force flushing some events ideal

Prototype implementation transport pick-up baskets transportable baskets recycled baskets full track collections recycled track collections Worker threads Dispatch & garbage collect thread Crossing tracks (itrack, ivolume) Push/replace collection Main scheduler n Inject priority baskets recycle basket ivolume loop tracks and push to baskets n Stepping(tid, &tracks) Digitize & I/O thread Priority baskets Generate(N events ) Hits Hits Digitize(iev) Disk Inject/replace baskets deque generate flush

Evolution of populations Flush events

Preliminary benchmarks HT mode Excellent CPU usage Benchmarking 10+1 threads on a 12 core Xeon Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues Event re-injection will improve the speed-up