GeantV – status and plan A. Gheata for the GeantV team.

Slides:



Advertisements
Similar presentations
Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
Mid Review of Class Argument Validation and Synchronization Guidelines April 26, 2000 Instructor: Gary Kimura.
Computer Organization and Architecture
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Status of the vector transport prototype Andrei Gheata 12/12/12.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,
Emergent Game Technologies Gamebryo Element Engine Thread for Performance.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
TB1: Data analysis Antonio Bulgheroni on behalf of the TB24 team.
VMC workshop1 Ideas for G4 navigation interface using ROOT geometry A.Gheata ALICE offline week, 30 May 05.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.
Martin Kruliš by Martin Kruliš (v1.1)1.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Software Tools for Layout Optimization (Fermilab) Software Tools for Layout Optimization Harry Cheung (Fermilab) For the Tracker Upgrade Simulations Working.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Follow-up to SFT Review (2009/2010) Priorities and Organization for 2011 and 2012.
Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.
GeantV code sprint report Fermilab 3-8 October 2015 A. Gheata.
Vector Prototype Status Philippe Canal (For VP team)
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
Processor Level Parallelism 1
MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.
GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.
Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.
Scheduler overview status & issues
UNIT–II: Process Management
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
GeantV Philippe Canal (FNAL) for the GeantV development team
Geant4 MT Performance Soon Yung Jun (Fermilab)
Intel Code Modernisation Project: Status and Plans
Vector Electromagnetic Physics Models & Field Propagation
A task-based implementation for GeantV
Task Scheduling for Multicore CPUs and NUMA Systems
GeantV – Parallelism, transport structure and overall performance
GeantV – Parallelism, transport structure and overall performance
Report on Vector Prototype
Real-time Software Design
Capriccio – A Thread Model
Operating Systems Chapter 5: Input/Output Management
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

GeantV – status and plan A. Gheata for the GeantV team

Outline  GeantV concept  Milestones  Concurrency & performance  Vector optimisations  Coprocessors  Geometry & physics  Conclusions 2

Scheduler Re-basketizer Geometry navigator Geometry algorithms Physics Basket of tracks x-sections Reactions Dispatching MIMD SIMD GeantV concept 3

Scheduler Outputs Geometry Filters Baskets (Fast)Physics Filters Stepper Input queue Vector stepper Step sampling Filter neutrals (Field) Propagator Step limiter reshuffle VecGeom navigator Full geometry Simplified geometry Physics sampler Phys. Process post-step Secondaries TO SCHEDULER Vol 1 Vol 2 Vol 3 Vol n e + /e - γ GeantV Coproc. broker 10 5 baskets/sec 4

Scheduler Outputs Geometry Filters Baskets (Fast)Physics Filters Stepper Input queue Vector stepper Step sampling Filter neutrals (Field) Propagator Step limiter reshuffle VecGeom navigator Full geometry Simplified geometry Physics sampler Phys. Process post-step Secondaries TO SCHEDULER Vol 1 Vol 2 Vol 3 Vol n e + /e - γ GeantV Coproc. broker baskets/sec

Project milestones ( ) CMS full simulation (Xeon): end Q4, 2015 Realistic scoring/ VecGeom geometry (vector mode)/ tab. physics Performance goal vs. standard approach: 2x-3x Co-processor mode (GPU, Phi) functional – end Q4, 2015 Kernel offload doing part of processing + performance optimizations GeantV simulation on Phi in native mode (KNL) – prototype by end Q2, 2016 Preformance optimisations by end Q4, 2016 Geometry navigation optimizations + full support for missing features E.g. divisions, solids Showcase doing full CMS geometry raytracing on MIC: SC’15 Support for other geometries than CMS: 2016 New vector-aware physics models (continuous integration) Finalizing interfaces, smooth transition from tab Xsec to vec. physics By end 2016: New MSC, gammas, electrons 6

Fine grain parallelism is challenging 24-core dual socket E GHz (IVB). 2 NUMA nodes Several parameters to be tuned Performance is monitored –Allows detecting and fixing bottlenecks Amdahl still high due to rebasketizing operations CMS2015.root Optimising scheduler performance against several parameters Lock-free algorithm (memory polling) Algorithm using spinlocks Rebasketizing 7

Lock-free basketizing  Operations accounted by atomics  “copied”, “copying”, “booked”  Snapshot counters, check conditions, validate at the end that resource was not stolen (Ibook == Ibook0)  “Try/retry” policy for acquiring ownership on basket operations  “replace”, “replace & dispatch”  C++11’s compare_exchange to do basket stealing  First and subsequent not able to book slot in range – try to replace & book next basket  “weak” = try and succeed if lucky  First and subsequent who finished copying at/beyond last slot – try to replace and dispatch  “strong” = failing means someone else did it  Garbage collector can force replace and dispatch anytime  No dispatch before finishing pending copying operations to have “clean” baskets I book0 I book I book0 += threshold threshold T1T1 T2T2 T3T3 T4T4 T5T5 TnTn threads… copied copying booked booking “weak” replace N copied =2 N copying =2 N booked >thres Work queue threads To transport… current basket replacement basket N copied >=threshold (Try dispatch) “strong” replace I book =I book0 Fresh basket basket = SOA of tracks /sec

Contention sources Physics: low profile, will go up with physics models Geometry + propagator: main consumer, will balance with physics in future Basketizer: main Amdahl source Reshuffling (fill holes created in vector container): constant overhead 9

Vector policies  Track fluence not uniform for all geometry volumes  Most volumes have to cumulate long time to reach threshold  Many never entered by transport…  Basketizing is an overhead to compensate by vectorization/locality  Needing ability to transport in vector/scalar mode  Basketizing only a subset of volumes  Event tails are long – prioritize transport per event  Triggered by number of tracks in flight / maximum  Remaining tracks transported with priority in scalar mode in a MIXED basket  Depending on the cuts, most steps in some volumes are forced by physics  Tracks remaining in the same volume, keep the in the same basket 10

Learning dynamically which volumes are “important”  Learning converges fast === Learning phase of steps completed === Activated 528/4156 volumes accounting for 90.0% of track steps CMS simulation, sorting distribution of steps per volume, cutting out tail based on integral Shielding (yes, cuts by region not yet available) 11

“Prioritizing” algorithm  Set priority threshold  “in flight”(magenta) / “max in flight” (blue) < 0.1 triggers priority transport  Tracks from prioritized events copied to thread local mixed baskets  Transported in scalar mode ### Event 1 prioritized at 10 % threshold === Garbage collect: 367 baskets === Garbage collect: 0 baskets Event 0: tracks transported, max in flight

Data locality  Main data structures  Geometry ~ O(10 MB)  Physics Xsec ~ O(500 MB)  Baskets ~ O(1GB) or more  Data (baskets) flushed too fast from caches due to non-local access  Ongoing work  Pin threads to NUMA nodes, enforce basket allocation with numa_alloc_onnode()  NUMA allocator slow – need to allocate big chunks and re-use  Re-factorizing basket manager to favor filling from threads on the same NUMA node  Minimize exchanged across NUMA boundary to get cache coherency 13

14 I/O performance  Transport threads filling concurrently hit blocks  Single thread writing blocks to ROOT tree  Job time bound by the maximum throughput of a single thread writing to file  Transport finishes much faster  If no I/O limitation, throughput would be much higher  Need to parallelize the work up to disk serialization  Split API in “prepare buffer” and “write to device/socket”? Disk throughput ~ 350 MB/s

CMS EcalCMS Ecal Large-scale exercise  Full CMS2015 geometry with 1MeV cuts  Comparison (tabulated physics, single thread) GeantV scheduler w. TGeo/VecGeom geometry Geant4 with equivalent G4 geometry  Make sure that transport behaves correctly and does the same job  Ongoing work to have realistic cuts/simulation  Already made some preliminary profiling Much less instruction cache misses and better CPI 15

Coprocessor broker  Task: feed baskets to the coprocessor and copy back transported baskets  2 threads working part time to carry data back and forth  One thread assemblies special baskets  Generally larger, vector aware or not (PHI or GPU)  First implementation ready, debugging phase  Offloading full step computation on GPU  What can we offload on the Phi?  Geometry – Full CMS, can run basic ray-tracing task  Scheduling GeantV baskets requires removing some library dependencies Coproc. broker 16

Geometry performance  real time for the simulation of 10 pp events at 7TeV using the new VecGom package instead of the existing ROOT geometry  More details on geometry performance in the talk of S. Wenzel 17

Physics: ongoing work  Several ongoing activities  Vectorized RNG  Vectorized field propagator  Interfaces & management  Physics models (Compton, gamma, e)  Redesign of TabXsec to adopt physics processes  Discussions/work on physics-related subjects planned for next week GeantV workshop 18

Conclusions  Very fruitful year with a lot of progress in several areas  Aproaching fast our major milestone  Running transport in a full LHC experiment in vector mode  Performance bottlenecks and opportunities better understood  A dense work plan for  New on the horizon: coprocessors and vector physics 19