Vector Physics Models Soon Yung Jun US January 30, 2015
Contents Emerging technologies and challenges for HEP software Vector physics models Vision S.Y. ASCR-HEP, 1/30/20152
Computing Challenges: The Tower of Babel? S.Y. ASCR-HEP, 1/30/20153 Exascale Cloud BIG Data Chaos -> DiversificationOne model -> Many models
Rmax Cores TOP500 (Nov. 15, 2014) S.Y. ASCR-HEP, 1/30/20154 Exascale: 2017 ± 2 No changes 2020 (?)
Q1: What is the fraction of software applications run on more than 10,000 or more cores? (Application Software Report) 1.0.1% 2.1% 3.5% 4.10% Is the1% important? Absolutely! Quantitative differences become Qualitative ones (Marx) S.Y. ASCR-HEP, 1/30/20155 Battle of speed -> Fitness for Survival
The Other Corner of the Battle Field: Caching and Vector Challenges for high performance HEP computing –HEP (and most real world) applications are memory bounded –HTC + HPC Long time ago at Troy: Death of Heroes Two decays ago: Death of Vector in HEP Era of Pentium: Microprocessors rule the HEP computing S.Y. ASCR-HEP, 1/30/20156 Microprocessor Achilius
Q2 What is the primary application running on WLGC 1.Reconstruction 2.Simulation 3.Analysis LHC simulation (Run1) –Several 10 7 volumes, events –10 12 sec CPU time using 250,000 cores –60% of WLGC (expected to 65% in LHC Run2) Challenges for High-luminosity LHC –Need at least x5 computing power with more likely a flat budget Opportunities –New architectures, new applications S.Y. ASCR-HEP, 1/30/20157
Hardware Side: Emerging Technologies Two categories –General purpose CPU (ARM, MIC-native) –Coprocessors (GPU, MIC-offload) New Metrics –GFlop/KWatt –Bandwidth/Latency (FLOPS is free) Even more questions –General purpose CPU and coprocessors will be merged –Mobile chips will rule the world (even top500) S.Y. ASCR-HEP, 1/30/20158
Challenges to Software Developers S.Y. ASCR-HEP, 1/30/20159 Evolution? Revolution? - Traditional? - Truly?
Demonstrators: Coprocessor (GPU/MIC) in HEP Lattice QCD Triggers Reconstruction and Analysis Simulation –accelerator –detector –physics –generator Most of them are memory bounded applications Hardware are far ahead (biased) than software (ecosystem) Example of collaborative efforts –GATE, GAP in RT, GeantV S.Y. ASCR-HEP, 1/30/201510
General Purpose CPU (ARM/MIC) in HEP Power efficiency –More events/Watt in a big GRID –ARM: “89% less energy, 94% less space, and 63% less cost” –porting existing software packages (trigger, reconstruction, etc) Vector pipeline –geometry and particle navigation –physics processes and models (???) S.Y. ASCR-HEP, 1/30/201511
Present to Future Challenges for high performance HEP computing –lack of scalable applications in HEP –no Moore’s law for software, algorithms and applications –overwhelmingly bias in favor of hardware should be rebalanced New strategies –mixture of two (CPU and coprocessors) – beginning of design –prioritizes the data-intensive operations to be executed by the accelerator –redesign kernels to keep the accelerators and processors busy Is this (hybrid with many-cores) right solution? S.Y. ASCR-HEP, 1/30/201512
Assumption is Important (at least in Science) Sun’s angles are different at different latitudes Measured the circumference of earth (Eratosthenes,~BC200) Measured the height of the sky (Ancient Chinese) Two different thinking will follow different paths S.Y. ASCR-HEP, 1/30/201513
S.Y. ASCR-HEP, 1/30/201514
Vector Physics Model Assumption: particles are independent during tracking Vectorization of the density of collisions, ψ Vector strategies: data locality and instruction throughput –decomposition sequential tracking and regroup them by tasks –algorithmic vectorization –parallel data patterns S.Y. ASCR-HEP, 1/30/201515
Q3 Typically, what fraction of all essential FORTRAN statements (legacy physics codes) is IF-THEN ? 1.10% 2.20% 3.35% 4.50% Conditional statements –implicit loops (do-while) –conditional coding (case) –optional coding (skip operations) Potential solutions –mask, shuffling, gather/scatter, pack-expand S.Y. ASCR-HEP, 1/30/201516
Pre-requisite SIMD (single instruction, multiple data) pseudo random number generator Data layout: coalesced memory access on vector operands –SoA (struct of array) track (x,p,E,t,…)[i], order data arrays Data locality for the vector particles: share common data –particle type, geometry and material, physics process Vector operations –identical instruction on each component of the vector –scalar + vector = scalar (do not mix them) –no conditional branches, no data dependencies –replace un-vectorizable algorithms by alternatives S.Y. ASCR-HEP, 1/30/201517
Target Basic components for physics kernels –free path analysis: sampling physics process and step length –collision analysis: energy loss, multiple scattering, secondary production Choices for vectorized physics models –tabulate physics (cross section calculation, final state sampling –fully vectorized arithmetic algorithms (auto, deep SIMD) Core techniques and patterns –conditional branches: mask –coalesced memory access: gather –composition and rejection: replaced by alias S.Y. ASCR-HEP, 1/30/201518
Q4 What is the most used Monte Carlo techniques in Geant4 1.Inverse transform 2.Acceptance and rejection 3.Composition and rejection 4.Vegas algorithm Problem: conditional branches (do-while) – not vectorizable –loop counter is un-deterministic Use effectively vectorizable algorithms –shuffling (overhead may be significant) –inverse cumulative pdf method (potential bias) –Alias method (A.J Walker, 1974) S.Y. ASCR-HEP, 1/30/201519
Sampling Secondary Particles: Alias Method (A.J.Walker) Replace composition and rejection methods (conditional branches – not vectorizable) Recast a cross section, f(x) to N equal probable events, each with likelihood 1/N=c Alias table –a[recipient]=donor –q[N] = non-alias probability Sampling x j : random u 1, u 2 –bin index: N x u 1 = i + –sample j = (q [i] < u 2 ) ? i : a[i] –x j = [ j + (1- ) (j+1)] x S.Y. ASCR-HEP, 1/30/201520
Alias Method: Validation and Performance Differential cross sections of EM processes (ex:scatter angle) Sizable performance gains both in CPU and GPU S.Y. ASCR-HEP, 1/30/201521
Coalesced Memory Access Sampling the step length and the associated physics process –cross section calculations on-the-fly (fully vectorizable, but may be costly) –Tabulated physics (gather operation for table look-ups, bandwidth limited) Rearrange data to enable contiguously ordered memory accesses Overhead?: reallocate data to the stack (< gain by vectorization) Sizable performance gains both in CPU and GPU S.Y. ASCR-HEP, 1/30/201522
Portability (Template Approach): Scale, Vector, CUDA, MIC S.Y. ASCR-HEP, 1/30/ Common interface for different architectures (Backend)
Plan Implement one fully vectorized EM physics model (Klein Nishina Compton) and test with GeantV by CHEP2015 –Vector, Scalar, CUDA –Performance evaluation and validation (tabulated and Geant4) Complete all EM physics –establish a backend schema for the vector physics package extend for hadron physics and explore other algorithms S.Y. ASCR-HEP, 1/30/201524
Vision: Ancien Régime to Liberty S.Y. ASCR-HEP, 1/30/ Failure of architecture-aware algorithms (and vice versa) is the dawn of new agies. Scientific advancement is not evolutionary, but rather is a “series of peaceful interludes punctuated by intellectually violent revolutions”, and in those revolutions “one conceptual world view is replaced by another” (Thomas Kuhn).