The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

Slides:

Advertisements

Similar presentations

© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.

Advertisements

A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.

Impact of parallelism on HEP software April 29 th 2013 Ecole Polytechnique/LLR Rene Brun.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Chapter 2: Computer-System Structures

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.

Chapter 17 Parallel Processing.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.

Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

Nachos Phase 1 Code -Hints and Comments

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

Status of the vector transport prototype Andrei Gheata 12/12/12.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

0 Status of Shower Parameterisation code in Athena Andrea Dell’Acqua CERN PH-SFT.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

QCAdesigner – CUDA HPPS project

EPICS Release 3.15 Bob Dalesio May 19, Features for 3.15 Support for large arrays - done for rsrv in 3.14 Channel access priorities - planned to.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Introduction What is detector simulation? A detector simulation program must provide the possibility of describing accurately an experimental setup (both.

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.

Parallel and Distributed Simulation Time Parallel Simulation.

CBM ECAL simulation status Prokudin Mikhail ITEP.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

1 Becoming More Effective with C++ … Day Two Stanley B. Lippman

Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

4330/6310 FIRST ASSIGNMENT Spring 2015 Jehan-François Pâris

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Observing the Current System Benefits Can see how the system actually works in practice Can ask people to explain what they are doing – to gain a clear.

Embedded Computer - Definition When a microcomputer is part of a larger product, it is said to be an embedded computer. The embedded computer retrieves.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.

The DCS Databases Peter Chochula. 31/05/2005Peter Chochula 2 Outline PVSS basics (boring topic but useful if one wants to understand the DCS data flow)

STAR Simulation. Status and plans V. Perevoztchikov Brookhaven National Laboratory,USA.

Copyright © 2005 – Curt Hill MicroProgramming Programming at a different level.

Detector SimOOlation activities in ATLAS A.Dell’Acqua CERN-EP/ATC May 19th, 1999.

Follow-up to SFT Review (2009/2010) Priorities and Organization for 2011 and 2012.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

Buffering Techniques Greg Stitt ECE Department University of Florida.

GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.

16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Scheduler overview status & issues

A task-based implementation for GeantV

Parallel and Distributed Simulation Techniques

Issues with Simulating High Luminosities in ATLAS

Report on Vector Prototype

Operating Systems.

COMP60611 Fundamentals of Parallel and Distributed Systems

Lecture 12 Input/Output (programmer view)

Presentation transcript:

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Where are we now? Present status  Several investigations of possible alternatives for “extremely parallel – no lock” transport  Not much code written, several blackboards full  Some investigation on a simplified but fully vectorized model to prove vectorization gain  New design in preparation

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Major points under discussion  How to minimise locks and maximise local handling of particles  How to handle hit and digit structures  How to preserve the history of the particles This point seems more difficult at the moment and it requires more design  What is the possible speedup obtained by micro- parallelisation  What are the bottlenecks and opportunities with parallel I/O

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Dispatcher thread Thread local 4 Current design Logical Volume Logical Volume basket p array p* array Transport Output particle store p* array p p*

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Features  Pros Good parallel performance but… Easy recording of particle history Limited data movement  Cons Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality 5

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 6 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 7 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev Transport thread Digitizer thread Ev build thread Events

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 8 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev Continuously rotated Flushed at the end of event

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Features  Pros Excellent potential locality Easy to introduce hits and digits  Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and introduce particle pruning 9 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Processing flow I  The transport thread takes particles from the input buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for further processing If the LV is a sensitive detector, hits are generated and stored per LV basket A LV basked history record is kept (we have no idea how for the moment, we need more blackboard work!)  Input and output particle buffers are fixed size structures, which can however evolve (be optimised) during simulation 10 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

11 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev ✔ full! ✗ empty!

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Processing flow II  When an input particle buffer is exhausted It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible memory waste Can be passed to a queue of “used baskets”, but this implies a lock In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones used, but this can be optmized  Used buffers are scanned by the dispatcher thread that updates a global track counter per event -1 for each stopped “dead” particle  And then they are declared “empty” to be reused  The transport thread picks up another “ready” basket 12 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Processing flow III  When an output particle buffer is full, it is marked as such Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones are full, but this can be optmized  The transport thread picks another empty output buffer  The dispatcher thread copies particles from the full output particle buffer to LV-specific input particle buffers Increasing the global particle event counter  When an input particle buffer is full, the dispatcher declares it “ready to be transported” 13 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

14 Design under study Input particle list Output particle list p array List of logical Volumes List of baskets for lv Logical Volume lv Empty buffer list Full buffer list Ready buffer list

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Processing flow IV  Note an important point The LV basket structure has input and output particle buffers and hits and history buffers  Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a single event  Hits and history buffers are Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data structure till the event is flushed 15 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Processing flow V  When an event is finished, the digitizer thread kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure  When this is over, the event is built into the event structure (to be designed!) by the event builder thread  After that, the history for this event is assembled by the same thread  Then the event is output 16 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Questions?  How many dispatcher, digitizer and event-builder threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help  Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of the volumes Threads could be distributed proportionally to the time spent in the different LVs 17 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Simple observation: HEP transport is mostly local ! Locality not exploited by the classical transportation approach Existing code very inefficient ( IPC) Cache misses due to fragmented code 50 per cent of the time spent in 50/7100 volumes

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Questions?  What about memory?  Fortunately we do not have “that many” LVs 19 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s DetectorPhysical volumesLogical volumes ALICE4,354,7354,764 ATLAS29,046,9667,143 CMS1,166,3181,537 LHCb18,491,756709

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Grand strategy 20 Simulation job Create vectors Basic algorithms Use vectors We are concentrating here But we should look also here

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 21 Short term tasks Continue the design work – essential before any more substantial implementation  This is the most important task at the moment  We have to evaluate the potential bottlenecks before starting the implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines  Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods  Particularly in the EM domain

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 22 Possible timeline Summer 2013  Implement a prototype according to the present design  Get essential numbers from G4 (to be defined!) Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc ecc.  Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard)  Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G4?) Fall 2013  Interface the methods above to the prototype to realise a first protype of vectorized transport

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Thank you!