Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
RE-THINKING PARTICLE TRANSPORT IN THE MANY-CORE ERA J.APOSTOLAKIS, R.BRUN, F.CARMINATI,A.GHEATA CHEP 2012, NEW YORK, MAY 1.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Status of the vector transport prototype Andrei Gheata 12/12/12.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
The CMS Simulation Software Julia Yarba, Fermilab on behalf of CMS Collaboration 22 m long, 15 m in diameter Over a million geometrical volumes Many complex.
ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.
VMC workshop1 Ideas for G4 navigation interface using ROOT geometry A.Gheata ALICE offline week, 30 May 05.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.
R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Follow-up to SFT Review (2009/2010) Priorities and Organization for 2011 and 2012.
Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.
GeantV code sprint report Fermilab 3-8 October 2015 A. Gheata.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
GeantV – status and plan A. Gheata for the GeantV team.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Philippe Canal (FNAL) for the GeantV development team G.Amadio (UNESP), Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
NFV Compute Acceleration APIs and Evaluation
Scheduler overview status & issues
TensorFlow– A system for large-scale machine learning
Current Generation Hypervisor Type 1 Type 2.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
ALICE experience with ROOT I/O
R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde
GeantV Philippe Canal (FNAL) for the GeantV development team
Geant4 MT Performance Soon Yung Jun (Fermilab)
Intel Code Modernisation Project: Status and Plans
Kilohertz Decision Making on Petabytes
Vector Electromagnetic Physics Models & Field Propagation
A task-based implementation for GeantV
Task Scheduling for Multicore CPUs and NUMA Systems
GeantV – Parallelism, transport structure and overall performance
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Report on Vector Prototype
CLARA Based Application Vertical Elasticity
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Lecture 2 The Art of Concurrency
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya (BARC), C.Bianchini (UNESP), R.Brun (CERN), P.Canal (FNAL), F.Carminati (CERN), L.Duhem (intel), D.Elvira (FNAL), A.Gheata (CERN), M.Gheata (CERN), I.Goulas (CERN), R.Iope (UNESP), S.Y.Jun (FNAL), G.Lima (FNAL), A.Mohanty (BARC), T.Nikitina (CERN), M.Novak (CERN), W.Pokorski (CERN), A.Ribon (CERN), R.Sehgal (BARC), O.Shadura (CERN), S.Vallecorsa (CERN), S.Wenzel (CERN), Y.Zhang (CERN) GeantV – From CPU to accelerators 17th International Workshop on Advanced Computing and Analysis Techniques in Physics Research January 18 – 22, 2016 UTFSM, Valparaíso (Chile)

Outline  GeantVectorized – an introduction  The problem  Challenges, ideas, goals  Main components and performance  Vectorization: overheads vs. gains  Geometry and physics performance: CPU and accelerators  Performance benchmarks on Intel©Xeon Phi (KNC)  Harnessing the Phi: the X-Ray benchmark  Results, milestones, plans 2

The problem Detailed simulation of subatomic particle transport and interactions in detector geometries Using state of the art physics models, propagation in fields in geometries having complexities of millions of parts Heavy computation requirements, massively CPU-bound, seeking organically HPC solutions… The LHC uses more than 50% of its distributed GRID power for detector simulations (~ CPU years equivalent so far) 3

The ideas  Transport particles in groups (vectors) rather than one by one  Group particles by geometry volume or same physics  No free lunch: data gathering overheads < vector gains  Dispatch SoA to functions with vector signatures  Use backends to abstract interface: vector, scalar  Use backends to insulate technology/library: Vc, Cilk+, VecMic, …  Redesign the library and workflow to target fine grain parallelism  CPU, GPU, Phi, Atom, …  Aim for a 3x-5x faster code, understand hard limits for more 4 template Backend::double_t common_distance_function( Backend::double_t input ) { // Single kernel algorithm using Backend types } struct VectorVcBackend { typedef Vc::double_v double_t; typedef Vc::double_m bool_t; static const boolIsScalar=false; static const bool IsSIMD=true; }; // Functions operating with backend types distance( vector_type &); distance( double &); struct ScalarBackend { typedef double double_t; typedef bool bool_t; static const bool IsScalar=true; static const bool IsSIMD=false; }; // Functions operating with backend types Scalar interface Vector interface code.compeng.uni-frankfurt.de/projects/vc

Outputs Baskets TO SCHEDULER GeantV 10 5 baskets/sec 5

Outputs Baskets TO SCHEDULER GeantV baskets/sec

Geometry performance (Phi vs Xeon)  Geometry is 30-40% of the total CPU time in Geant4  A library of vectorized geometry algorithms to take maximum advantage of SIMD architectures  Substantial performance gains also in scalar mode  Testing the same on on GPU 7 16 particles 1024 particles SIMD max Intel Ivy-Bridge (AVX)~2.8x~4x4x Intel Haswell (AVX2)~3x~5x4x Intel Xeon Phi (AVX-512)~4.1~4.88x Overall performance for a simplified detector vs. scalar ROOT/ Vectorization performance for trapezoid shape navigation (Xeon®Phi® C0PRQ-7120 P)

Geometry performance on K20  Speedup for different navigation methods of the box shape, normalized to scalar CPU  Scalar (specialized/unspecialized)  Vector  GPU (Kepler K20)  ROOT  Data transfer in/out is asynchronous  Measured only the kernel performance, but providing constant throughput can hide transfer latency  The die can be saturated with both large track containers, running a single kernel, or with smaller containers dynamically scheduled.  Just a baseline proving we can run the same code on CPU/accelerators, to be optimized 8

Physics performance  Objective: a vector/accelerator friendly re-write of physics code  The vectorised Compton scattering shows good performance gains  Current prototype able to run an exercise at the scale of an LHC experiment (CMS)  Simplified (tabulated) physics but full geometry, RK propagator in field  Very preliminary results needing validation, but hinting to performance improvements of factors 9 CMS Ecal Alias sampling performance on a Kepler K30 (Soon’ Y. - EM physics models on parallel computing architectures)

Hits/digits I/O  “Data” mode  Send concurrently data to one thread dealing with full I/O 10  “Buffer” mode  Send concurrently local trees connected to memory files produced by workers to one thread dealing with merging/write to disk  Integrating user code with a highly concurrent framework should not spoil performance

Basketizer performance  Investigated different ways of scheduling & sharing work - lock free queues,..  Changes in scheduler require non- trivial effort (rewrite)  Amdahl still large, due to high re- basketizing load (concurrent copying)  O(10 5 ) baskets/second on Intel Core i7™  Algorithm already lock free  Rate will go down with physics processes  Ongoing work to improve scalability  Re-use baskets in the same thread after step if enough particles doing physics-limited steps  Clone scheduling in NUMA aware groups, important for many cores (e.g. KNL) 11 Lock-free algorithm (memory polling) Algorithm using spinlocks Rebasketizing 2x Intel(R) Xeon(R) CPU E GHz Work stealing only when needed

The X-Ray benchmark  The X-Ray benchmark tests geometry navigation in a real detector geometry  X-Ray scans a module with virtual rays in a grid corresponding to pixels on the final image  Each ray is propagated from boundary to boundary  Pixel gray level determined by number of crossings  A simple geometry example (concentric tubes) emulating a tracker detector used for Xeon©Phi benchmark  To probe the vectorized geometry elements + global navigation as task  OMP parallelism + “basket” model 12 OMP threads

Scalability and throughput  Better behavior using OMP balanced  Approaching well the ideal curve up to native cores count  Balanced threading converges towards the compact model as all the thread slots are filled  It’s worth to run Xeon Phi saturated for our application  The throughput performance for a saturated KNC is equivalent (for this setup) to the dual Xeon E5- server which hosts the card. 13

Vector performance  Gaining up to 4.5 from vectorization in basketized mode  Approaching the ideal vectorization case (when no regrouping of vectors is done).  Vector starvation starts when filling more thread slots than the core count  Performance loss is not dramatic  Better vectorization compared to the Sandy-Bridge host (expected)  Scalar case: Simple loop over pixels  Ideal vectorization case: Fill vectors with N times the same X-ray  Realistic (basket) case: Group baskets per geometry volume 14

Profiling for the X-Ray benchmark  Good vectorization intensity, thread activity and core usage for the X-Ray basketized benchmark on a Xeon Phi (61 core C0PRQ-7120 P)  The performance tools gave us good insight on the current performance of GeantV 15

Looking forward to…  … implementing a “smoking gun” demonstrator combining all prototype features  SIMD gains in the full CMS experiment setup  Coprocessor broker in action: part of the full transport kernel running on Xeon®Phi® and GPGPU  Scalability and NUMA awareness for rebasketizing procedure  … achieving these just moves the target a bit further  … testing and optimizing the workflow on KNL  Important architecture to test how flexible our model is  Expecting epic “fights” for scaling up the performance  Already working on implementing device-specific scheduling policies and addressing NUMA awareness 16

Accelerators: plans for 2016  Porting and optimizing GeantV on BDW & KNL architectures  Porting the prototype & adjusting programming model  Profiling and understanding hotspots, reiterating first phase  Tuning application parameters to optimize event throughput  Running the full prototype using GPU as co-processor  Debugging the coprocessor broker and data flow through GPU  Understanding performance issues and latency limitations  Optimizing the application for the GPU cooperative mode (tuning scheduling parameters) 17

Acknowledgements  The authors acknowledge the contribution and support from Intel (funding, technical and expert advisory, SW/HW) and CERN openlab 18

Thank you! 19