The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Where are we now? Present status Several investigations of possible alternatives for “extremely parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to prove vectorization gain New design in preparation
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Major points under discussion How to minimise locks and maximise local handling of particles How to handle hit and digit structures How to preserve the history of the particles This point seems more difficult at the moment and it requires more design What is the possible speedup obtained by micro- parallelisation What are the bottlenecks and opportunities with parallel I/O
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Dispatcher thread Thread local 4 Current design Logical Volume Logical Volume basket p array p* array Transport Output particle store p* array p p*
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Features Pros Good parallel performance but… Easy recording of particle history Limited data movement Cons Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality 5
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 6 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 7 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev Transport thread Digitizer thread Ev build thread Events
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 8 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev Continuously rotated Flushed at the end of event
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Features Pros Excellent potential locality Easy to introduce hits and digits Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and introduce particle pruning 9 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow I The transport thread takes particles from the input buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for further processing If the LV is a sensitive detector, hits are generated and stored per LV basket A LV basked history record is kept (we have no idea how for the moment, we need more blackboard work!) Input and output particle buffers are fixed size structures, which can however evolve (be optimised) during simulation 10 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
11 Design under study Input particle list Output particle list p array Hits p array History List of logical Volumes List of baskets for lv Active event list Sensitive volumes Digits for lv and event ev Logical Volume lv List of active events for lv Event ev ✔ full! ✗ empty!
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Processing flow II When an input particle buffer is exhausted It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible memory waste Can be passed to a queue of “used baskets”, but this implies a lock In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones used, but this can be optmized Used buffers are scanned by the dispatcher thread that updates a global track counter per event -1 for each stopped “dead” particle And then they are declared “empty” to be reused The transport thread picks up another “ready” basket 12 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow III When an output particle buffer is full, it is marked as such Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones are full, but this can be optmized The transport thread picks another empty output buffer The dispatcher thread copies particles from the full output particle buffer to LV-specific input particle buffers Increasing the global particle event counter When an input particle buffer is full, the dispatcher declares it “ready to be transported” 13 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
14 Design under study Input particle list Output particle list p array List of logical Volumes List of baskets for lv Logical Volume lv Empty buffer list Full buffer list Ready buffer list
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Processing flow IV Note an important point The LV basket structure has input and output particle buffers and hits and history buffers Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a single event Hits and history buffers are Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data structure till the event is flushed 15 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow V When an event is finished, the digitizer thread kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure When this is over, the event is built into the event structure (to be designed!) by the event builder thread After that, the history for this event is assembled by the same thread Then the event is output 16 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Questions? How many dispatcher, digitizer and event-builder threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of the volumes Threads could be distributed proportionally to the time spent in the different LVs 17 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Simple observation: HEP transport is mostly local ! Locality not exploited by the classical transportation approach Existing code very inefficient ( IPC) Cache misses due to fragmented code 50 per cent of the time spent in 50/7100 volumes
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Questions? What about memory? Fortunately we do not have “that many” LVs 19 SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s DetectorPhysical volumesLogical volumes ALICE4,354,7354,764 ATLAS29,046,9667,143 CMS1,166,3181,537 LHCb18,491,756709
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Grand strategy 20 Simulation job Create vectors Basic algorithms Use vectors We are concentrating here But we should look also here
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 21 Short term tasks Continue the design work – essential before any more substantial implementation This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods Particularly in the EM domain
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s 22 Possible timeline Summer 2013 Implement a prototype according to the present design Get essential numbers from G4 (to be defined!) Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc ecc. Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard) Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G4?) Fall 2013 Interface the methods above to the prototype to realise a first protype of vectorized transport
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s Thank you!