Vectorizing the detector geometry to optimize particle transport CHEP 2013, Amsterdam J.Apostolakis, R.Brun, F.Carminati, A.Gheata, S.Wenzel.

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Computer Organization and Architecture

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Chapter 8: Introduction to High-level Language Programming Invitation to Computer Science, C++ Version, Third Edition.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Computer Organization

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

U-Solids: new geometrical primitives library for Geant4 and ROOT Marek Gayer CERN Physics Department (PH) Group Software Development for Experiments (SFT)

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

GPU Architecture and Programming

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

OpenCL Programming James Perry EPCC The University of Edinburgh.

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

STATUS OF THE UNIFIED SOLIDS LIBRARY Gabriele Cosmo/CERN Tatiana Nikitina/CERN.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

U-Solids: new geometrical primitives library for Geant4 and ROOT Marek Gayer CERN Physics Department (PH) Group Software Development for Experiments (SFT)

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Single Node Optimization Computational Astrophysics.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.

Embedded Systems Design

Chapter 9 – Real Memory Organization and Management

Exploiting Parallelism

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Report on Vector Prototype

Lecture 5: GPU Compute Architecture

Lecture 5: GPU Compute Architecture for the last time

Performance Optimization for Embedded Software

Compiler Back End Panel

Compiler Back End Panel

Java Programming Introduction

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Programming with Shared Memory Specifying parallelism

6- General Purpose GPU Programming

Introduction to Optimization

Presentation transcript:

Vectorizing the detector geometry to optimize particle transport CHEP 2013, Amsterdam J.Apostolakis, R.Brun, F.Carminati, A.Gheata, S.Wenzel

Motivation Explore possibilities to recast particle simulation so that it takes advantage from all performance dimensions/technologies Geometry navigation takes a large part of the particle transport budget (40- 60% in ALICE) Currently not exploited because requires “parallel data” to work on Dimension 2 (“troughput increase”) : in-core instruction-level parallelism and vectorization Research projects (GPU prototype and Simulation Vector Prototype) have started targeting beyond dimension 1: parallel data (“baskets”) = particles from different events grouped by logical volumes In HEP, mainly to reduce memory footprint Dimension 1 (“sharing data”) : multithreading/multicore

Commodity processors have vector registers on whose components (single) instruction can be performed in parallel (micro-parallelism) Reminder of vector micro- parallelism v w yz b c d a a*vb*w c*y d*z CPU instruction single instruction multiple data = SIMD Examples of SIMD architectures: MMX, SSE, AVX,... a*v CPU instruction optimal usage (vector registers full)current usage (3/4 empty for AVX) We are loosing factors! a v

1st Goal: Vector processing for a single solid 1 particle vector of N particles 1 result N results Provide new interfaces to process vectors in basic geometry algorithms Make efficient use of baskets and try to use SIMD vector instructions wherever possible (throughput optimization) 1. Milestone Goal: Enable geometry components to process baskets/vectors of data and study performance opportunities

2nd goal: vector processing within a volume Scalar (simple) navigation versus vector navigation Each particle undergoes a series of basic algorithms (with outer loop over particles) Each algorithm takes a basket of particles and spits out vectors to the next algorithms less function calls! SIMD (SSE, AVX ) instructions better code locality (icache) 2. Milestone

code.compeng.uni- frankfurt.de/projects/Vc “Autovectorization”: Let the compiler figure this out himself (without code changes). Pro: best option for portability and maintenance Cons: This currently never works ( but in a few cases ).... Explicit vector oriented programming via intrinsics: Manually instruct the compiler to use vector instructions: At the lowest level: intrinsics, assembler code, hard to write/read/maintain At higher level: template based APIs that hide low level details like the Vc library Pro: good performance, portability, only little platform dependency (templates!) Cons: requires some code changes, refactoring of code Language extensions, such as Intel Cilk Plus Array notation Similiar to point 2, investigated but not covered in this talk The programming model In order to use SIMD CPU capabilities, need to provide special assembly instructions (e.g. “vaddp” versus “add”) to the hardware. Multiple options exist:

Status of simple shape/algorithm investigations comparison of processing times for 1024 particles (AVX instructions), times in microseconds DistFromOutside DistFromInside Provided vector interfaces to all shapes and optimized code to simple ROOT shapes for: “DistFromInside”, “DistFromOutside”, “Safety”, “Contains” good experience and results using the Vc programming model For simple shapes the performance gains match our expectations

Status of refactoring simple algorithms (II) A lot of work still to do in SIMD-optimizing more complicated shapes; preliminary results available for Polycone (backup) Besides shapes, vector-optimize other simple algorithmic blocks: coordinate and vector transformations (“master-to-local”) min, max algorithms, legacy geometries All 4 LHC experiments

Benchmarking the Vector Navigation have now everything together to compare vs in: N particles in a logical volume out: steps and next boundaries for N particles scalarvector

Testing a simple navigation algorithm endcap (cone) plate detectors beampipe (tube) tubular shield I mplemented a toy detector for a benchmark: 2 tubes, 4 plate detectors, 2 endcaps (cones), 1 tubular mother volume Logical volume filled with test particle pool (random position and random direction) from which we use a subset N for benchmarks (P repetitions)

Results from Benchmark: Overall Runtime total speedup of 3.1 time of processing/navigating N particles ( P repetitions) using scalar algorithm (ROOT) versus vector version some further gain with AVX already considerable gain for small N there is an optimal point of operation (performance degradation for large N) excellent speedup for SSE4 version

Further Metrics: Executed Instructions gain mainly due to less instructions (for the same work) comparison for N=1024 particles (AVX versus ROOT seq) detailed analysis (binary instrumentation) can give statistics, e.g.: ROOTVec MOV30%15% CALL4%0.4% V..PD (SIMD instr) 5%55% investigate origin of speedup: study hardware performance counters developed a “timer” based approach where we read out counter before and after an arbitrary code section ( using libpfm )

Further Metrics: L1 instruction cache misses Better code locality, expected to have more impact when navigation itself is embedded in more complex environment.

Further Metrics: total cache misses More data cache misses for large number of particles, can degrade performance likely due to structure-of-array usage in vector Realistic N that can be scheduled is an important parameter

Summary / Outlook more complex shapes and algorithms ( voxelization ), USolids... Xeon Phi, (GPU) full flow of vectors in Geant-V prototype (vectorisation gains vs. scatter-gather overhead) Gains on accelerators (GPUs, Intel Phi,...) using vectorized code Outlook Vectorization is not threading and needs special efforts A vector/basket architecture allows to make use of SIMD but also increases locality (less functions calls, more instruction cache friendly) Provided a first refactored vector API in ROOT geometry/navigation library and showed good performance gains for individual as well as combined algorithms on commodity hardware Very good experience with explicit vector oriented programming model (Vc, Intel Cilk Plus Arrays)

Acknowledgements contributors to basic Vc coding: Juan Valles (CERN summer student) Marilena Bandieramonte (University of Catania, Italy) Raman Sehgal (BARC, India) help performance analysis /investigation of Intel Cilk Plus Array Notation: Laurent Duhem (Intel) CERN Openlab

Backup slides

Notes on benchmark conditions System: Ivybridge iCore7 (4 core, not hyperthreaded (can read out 8 hardware performance counters)) Compiler: gcc4.7.2 ( compile flags -O2 -unroll-loops -ffast- math -mavx) OS: slc6 Vc version: 0.73 benchmarks usually run on empty system with cpu pinning (taskset -c ) benchmarks use preallocated pool of test data, in which we take out N particles for processing. Repeat this P times. For repetitions distinguish between random access of N particles (higher cache impact) or sequential access in data pool (as shown here) benchmarks shown use NxP=const to time an overall similar amount of work

converting code for data parallelism can be a pain... (see challenges) would be nice to have better tool support for this task, helping at least with often recurring work Backup: A need for tools... A possible direction: source-to-source transformations (preprocessing) provide trivial vectorized code version of a function unroll inner loops, rewrite early returns,... Clang/LLVM API very promising for this... currently investigating Some tools go into this direction: Scout ( TU Dresden ): Can take code within a loop and emit instrinsics code for all kinds of architectures could be used in situations where the compiler does not autovectorize

void foo(double const *a, double const *b, double * out, int np){ for(int i=0;i<np;i+=Vc::double_v::Size) { // fetch chunk of data into Vc vector Vc::double_v a_v(&a[i]); Vc::double_v b_v(&b[i]); // computation just as before b_v = b_v*(a_v + b_v); // store back result into output array b_v.store( &out[i] ); } // tail part of loop has to follow } Example of Vc programming void foo(double const *a, double const *b, double * out, int np){ for(int i=0;i<np;i++) { out[i]=b[i]*(a[i]+b[i]); } example in plain C although simple and data parallel (probably) does not vectorize without further hints to the compiler (“restrict”) example in Vc restructuring the loop stride explicit inner vector declaration always vectorizes (no other hints necessary) architecture independent because Vc::double_v::Size is template constant determined at compile time portable branches/masks supported

Example with Intel Cilk Plus Array Notation // CEAN example void foo(double const * a, double const *b, double * out, int np) { int const VecSize=4; for(int i=0;i<np;i+=VecSize) { // cast input as fixed size vector double const (*av)[VecSize] = (double const (*)[VecSize]) &a[i]; double const (*bv)[VecSize] = (double const (*)[VecSize]) &b[i]; // give compiler hints __assume_aligned(av,32); __assume_aligned(bv,32); // cast output as fixed size vector double (*outv)[VecSize] = (double (*)[VecSize]) &out[i]; __assume_aligned(outv,32); // computation and storage in CILK PLUS ARRAY NOTATION // will vectorize outv[0][:] = bv[0][:]*(av[0][:] + bv[0][:]); } Intel Cilk Plus Array Notation indicates to the compiler operations on parallel data and leads to better autovectorization Programming can be similar to Vc (but with seemingly more code bloat at the moment) -- somewhat constructed example ( is possible in easier manner as well ) working with small vectors of VecSize wanted because allows for “early returns”, finer control

Status for more complex shapes Polycone is one of the more important complex shape used in detector descriptions algorithm used in ROOT uses recursive function calls which are not directly translatable to a Vc-type programming; similar for modern approach in USolids which uses voxelization techniques Using a simple brute force approach ( for all particles just test all segments ) has shown to give performance improvements for smaller polycones by Juan Valles (CERN summer student)

A differend kind of data parallelism “From particle data parallelism to segment (data) parallelism” for large polycons could use try to vectorize over segments instead of particles ( currently developing ) similar idea could work even for voxelized / tesselated solids “evaluate distance to shaded segments in a vectorized fashion”