Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

OpenFOAM on a GPU-based Heterogeneous Cluster

Electromagnetic Physics I Tatsumi KOI SLAC National Accelerator Laboratory (strongly based on Michel Maire's slides) Geant4 v9.2.p02 Standard EM package.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Electromagnetic Physics I Joseph Perl SLAC National Accelerator Laboratory (strongly based on Michel Maire's slides) Geant4 v9.3.p01 Standard EM package.

CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

GEant4 Parallelisation J. Apostolakis. Session Overview Part 1: Geant4 Multi-threading C++ 11 threads: opportunity for portability ? Open, revised and.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Implementing a dual readout calorimeter in SLIC and testing Geant4 Physics Hans Wenzel Fermilab Friday, 2 nd October 2009 ALCPG 2009.

GEant4 Parallelisation J. Apostolakis. Session Overview Part 1: Geant4 Multi-threading C++ 11 threads: opportunity for portability ? Open, revised and.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

0 Status of Shower Parameterisation code in Athena Andrea Dell’Acqua CERN PH-SFT.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,

Geant4 based simulation of radiotherapy in CUDA

Computing Performance Recommendations #13, #14. Recommendation #13 (1/3) We recommend providing a simple mechanism for users to turn off “irrelevant”

Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

Geant4 in production: status and developments John Apostolakis (CERN) Makoto Asai (SLAC) for the Geant4 collaboration.

New software library of geometrical primitives for modelling of solids used in Monte Carlo detector simulations Marek Gayer, John Apostolakis, Gabriele.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Physics Data Libraries: Content and Algorithms for Improved Monte Carlo Simulation Physics data libraries play an important role in Monte Carlo simulation:

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Muhammad Shoaib Bin Altaf. Outline Motivation Actual Flow Optimizations Approach Results Conclusion.

Gamma Photon Transport on the GPU for PET László Szirmay-Kalos, Balázs Tóth, Milán Magdics, Dávid Légrády, Anton Penzov Budapest University of Technology.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Sunpyo Hong, Hyesoon Kim

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Parallelization of the SIMD Kalman Filter for Track Fitting R. Gabriel Esteves Ashok Thirumurthi Xin Zhou Michael D. McCool Anwar Ghuloum Rama Malladi.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Elliptic flow of D mesons Francesco Prino for the D2H physics analysis group PWG3, April 12 th 2010.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

GeantV code sprint report Fermilab 3-8 October 2015 A. Gheata.

Vector computers.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Vector Physics Models Soon Yung Jun US January 30, 2015.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Status of Vectorized Physics Soon Yung Jun (FNAL).

GeantV Philippe Canal (FNAL) for the GeantV development team

Geant4 MT Performance Soon Yung Jun (Fermilab)

Kilohertz Decision Making on Petabytes

Vector Electromagnetic Physics Models & Field Propagation

Morgan Kaufmann Publishers

STUDY AND IMPLEMENTATION

Presentation transcript:

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel (CERN) for the GeantV team CHEP, Apr , 2015, Okinawa, Japan

Contents Introduction GeantV Vector Physics Model Validation and Performance Conclusion Ph. Canal, CHEP20152

Introduction Motivations –Performance of our code scales with clock cycle –HEP code needs to exploit new architectures to improve –Data & instruction locality and vectorisation –Portability, better physics and optimization will be the targets Ph. Canal, CHEP20153

Introduction GeantV Goals –Develop an all-particle transport simulation program 2 to 5 times faster than Geant4 Continues improvement of physics Full simulation and various options for fast simulation Portable on different architectures, including accelerators (GPUs and Xeon Phi’s) –Understand the limiting factors for 10x improvement Ph. Canal, CHEP20154 See The GeantV project: preparing the future of simulationThe GeantV project: preparing the future of simulation on 14 Apr 2015 at 17:15

GeantV: The next generation detector simulation toolkits The GeantV framework: scheduling, geometry, physics Ph. Canal, CHEP20155 WORK QUEUE TO SCHEDULER

Vector Physics Model Assumption: particles are independent during tracking Vectorization of the density of collisions, ψ Vector strategies: data locality and instruction throughput –decomposition sequential tracking and regroup them by tasks –algorithmic vectorization and parallel data patterns –targeting both external and internal (SIMD) vectorization Ph. Canal, CHEP20156

Portability (Template Approach): Scalar, Vector, CUDA, MIC Ph. Canal, CHEP20157

Prerequisites to Achieve Efficient Vectorization Vectorized pseudo-random number generator Data layout: coalesced memory access on vector operands –SoA (struct of array) tracks parameters (x,p,t,E …) –ordered and aligned data arrays Data locality for the vector of particles –particle type, geometry and material, physics process Vector operations –identical instructions on each components of the vector –no conditional branches, no data dependencies –replace non-vectorizable algorithms (ex. composition and rejection methods) by alternatives Ph. Canal, CHEP20158

Sampling Secondary Particles: Alias Method (A.J.Walker) Recast a cross section, f(x) to N equal probable events, each with likelihood c = 1/N Alias table –a[recipient] = donor –q[N] = non-alias probability Sampling x j : random u 1, u 2 –bin index: N x u 1 = i +  –sample j = (q [i] < u 2 ) ? i : a[i] –x j = [  j + (1-  ) (j+1)]  x Replace composition and rejection methods (conditional branches – not vectorizable) Ph. Canal, CHEP20159

Coalesced Memory Access Sampling the step length and the physics process –cross section calculation on-the-fly (fully vectorizable, likely expensive) –tabulated physics (table-lookups, bandwidth limited) Gather data to enable contiguously ordered accesses –loss by overhead < gain by vectorization Ph. Canal, CHEP201510

Validation: Alias vs. Composition and Rejection Method Compton (Klein-Nishina model): energy and angle of scattered photons Ph. Canal, CHEP201511

Vector Speedup: Factor 2 on Xeon Ph. Canal, CHEP201512

Runtime Performance Relative performance for sampling the secondary electron –Composition Method, Scalar, Vector –average time for 100 trials for 4992x100 tracks – SSE –Table size [input energy bins, sample energy bins] Note that Composition Method Klein-Nishina model is one of the most efficient composition and rejection examples (ε~1) Soon- GeantV, 3/24/ Table sizeTime with [100,100]Time with [100,1000] Composition Method Alias Method, Scalar Alias Method, Vector

Status and Plan Implement one fully vectorized EM physics model (Klein Nishina Compton) and test with GeantV –Backend: Scalar, Vector, CUDA –Performance evaluation and validation Complete all EM physics by Dec Extend for hadron physics and explore other algorithms Ph. Canal, CHEP201514

Conclusion Significant performance improvement achievable in detector simulation physics code using a combination of: –Alternative algorithm (reducing branching, etc.) –Vectorization –Increased use of code and data caches Using template techniques, code is portable to different modern computing architectures while still being tuned for each architecture. Ph. Canal, CHEP201515

Backup Slides Ph. Canal, CHEP201516

Ph. Canal, CHEP201517