Vector Physics Models Soon Yung Jun US January 30, 2015.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
The University of Adelaide, School of Computer Science
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
Some Thoughts on Technology and Strategies for Petaflops.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Monte Carlo Comparison of RPCs and Liquid Scintillator R. Ray 5/14/04  RPCs with 1-dimensional readout (generated by RR) and liquid scintillator with.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
Geant4 based simulation of radiotherapy in CUDA
Computing Performance Recommendations #13, #14. Recommendation #13 (1/3) We recommend providing a simple mechanism for users to turn off “irrelevant”
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Moore vs. Moore Rainer Schwemmer, LHCb Computing Workshop 2015.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
AliRoot survey: Reconstruction P.Hristov 11/06/2013.
Vector computers.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
Evolving Architecture for Beyond the Standard Model Kihyeon CHO (KISTI) Yongpyong-High Joint Winter Conference on Particle Physics, String and Cosmology.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Status of Vectorized Physics Soon Yung Jun (FNAL).
Centre of Excellence in Physics at Extreme Scales Richard Kenway.
CS203 – Advanced Computer Architecture
FPGAs for next gen DAQ and Computing systems at CERN
CS427 Multicore Architecture and Parallel Computing
Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.
Geant4 MT Performance Soon Yung Jun (Fermilab)
Parallel Processing - introduction
Vector Electromagnetic Physics Models & Field Propagation
Real-Time Ray Tracing Stefan Popov.
Morgan Kaufmann Publishers
Compiler Back End Panel
Compiler Back End Panel
Introduction to Heterogeneous Parallel Computing
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

Vector Physics Models Soon Yung Jun US January 30, 2015

Contents Emerging technologies and challenges for HEP software Vector physics models Vision S.Y. ASCR-HEP, 1/30/20152

Computing Challenges: The Tower of Babel? S.Y. ASCR-HEP, 1/30/20153 Exascale Cloud BIG Data Chaos -> DiversificationOne model -> Many models

Rmax Cores TOP500 (Nov. 15, 2014) S.Y. ASCR-HEP, 1/30/20154 Exascale: 2017 ± 2 No changes 2020 (?)

Q1: What is the fraction of software applications run on more than 10,000 or more cores? (Application Software Report) 1.0.1% 2.1% 3.5% 4.10% Is the1% important? Absolutely! Quantitative differences become Qualitative ones (Marx) S.Y. ASCR-HEP, 1/30/20155 Battle of speed -> Fitness for Survival

The Other Corner of the Battle Field: Caching and Vector Challenges for high performance HEP computing –HEP (and most real world) applications are memory bounded –HTC + HPC Long time ago at Troy: Death of Heroes Two decays ago: Death of Vector in HEP Era of Pentium: Microprocessors rule the HEP computing S.Y. ASCR-HEP, 1/30/20156 Microprocessor Achilius

Q2 What is the primary application running on WLGC 1.Reconstruction 2.Simulation 3.Analysis LHC simulation (Run1) –Several 10 7 volumes, events –10 12 sec CPU time using 250,000 cores –60% of WLGC (expected to 65% in LHC Run2) Challenges for High-luminosity LHC –Need at least x5 computing power with more likely a flat budget Opportunities –New architectures, new applications S.Y. ASCR-HEP, 1/30/20157

Hardware Side: Emerging Technologies Two categories –General purpose CPU (ARM, MIC-native) –Coprocessors (GPU, MIC-offload) New Metrics –GFlop/KWatt –Bandwidth/Latency (FLOPS is free) Even more questions –General purpose CPU and coprocessors will be merged –Mobile chips will rule the world (even top500) S.Y. ASCR-HEP, 1/30/20158

Challenges to Software Developers S.Y. ASCR-HEP, 1/30/20159 Evolution? Revolution? - Traditional? - Truly?

Demonstrators: Coprocessor (GPU/MIC) in HEP Lattice QCD Triggers Reconstruction and Analysis Simulation –accelerator –detector –physics –generator Most of them are memory bounded applications Hardware are far ahead (biased) than software (ecosystem) Example of collaborative efforts –GATE, GAP in RT, GeantV S.Y. ASCR-HEP, 1/30/201510

General Purpose CPU (ARM/MIC) in HEP Power efficiency –More events/Watt in a big GRID –ARM: “89% less energy, 94% less space, and 63% less cost” –porting existing software packages (trigger, reconstruction, etc) Vector pipeline –geometry and particle navigation –physics processes and models (???) S.Y. ASCR-HEP, 1/30/201511

Present to Future Challenges for high performance HEP computing –lack of scalable applications in HEP –no Moore’s law for software, algorithms and applications –overwhelmingly bias in favor of hardware should be rebalanced New strategies –mixture of two (CPU and coprocessors) – beginning of design –prioritizes the data-intensive operations to be executed by the accelerator –redesign kernels to keep the accelerators and processors busy Is this (hybrid with many-cores) right solution? S.Y. ASCR-HEP, 1/30/201512

Assumption is Important (at least in Science) Sun’s angles are different at different latitudes Measured the circumference of earth (Eratosthenes,~BC200) Measured the height of the sky (Ancient Chinese) Two different thinking will follow different paths S.Y. ASCR-HEP, 1/30/201513

S.Y. ASCR-HEP, 1/30/201514

Vector Physics Model Assumption: particles are independent during tracking Vectorization of the density of collisions, ψ Vector strategies: data locality and instruction throughput –decomposition sequential tracking and regroup them by tasks –algorithmic vectorization –parallel data patterns S.Y. ASCR-HEP, 1/30/201515

Q3 Typically, what fraction of all essential FORTRAN statements (legacy physics codes) is IF-THEN ? 1.10% 2.20% 3.35% 4.50% Conditional statements –implicit loops (do-while) –conditional coding (case) –optional coding (skip operations) Potential solutions –mask, shuffling, gather/scatter, pack-expand S.Y. ASCR-HEP, 1/30/201516

Pre-requisite SIMD (single instruction, multiple data) pseudo random number generator Data layout: coalesced memory access on vector operands –SoA (struct of array) track (x,p,E,t,…)[i], order data arrays Data locality for the vector particles: share common data –particle type, geometry and material, physics process Vector operations –identical instruction on each component of the vector –scalar + vector = scalar (do not mix them) –no conditional branches, no data dependencies –replace un-vectorizable algorithms by alternatives S.Y. ASCR-HEP, 1/30/201517

Target Basic components for physics kernels –free path analysis: sampling physics process and step length –collision analysis: energy loss, multiple scattering, secondary production Choices for vectorized physics models –tabulate physics (cross section calculation, final state sampling –fully vectorized arithmetic algorithms (auto, deep SIMD) Core techniques and patterns –conditional branches: mask –coalesced memory access: gather –composition and rejection: replaced by alias S.Y. ASCR-HEP, 1/30/201518

Q4 What is the most used Monte Carlo techniques in Geant4 1.Inverse transform 2.Acceptance and rejection 3.Composition and rejection 4.Vegas algorithm Problem: conditional branches (do-while) – not vectorizable –loop counter is un-deterministic Use effectively vectorizable algorithms –shuffling (overhead may be significant) –inverse cumulative pdf method (potential bias) –Alias method (A.J Walker, 1974) S.Y. ASCR-HEP, 1/30/201519

Sampling Secondary Particles: Alias Method (A.J.Walker) Replace composition and rejection methods (conditional branches – not vectorizable) Recast a cross section, f(x) to N equal probable events, each with likelihood 1/N=c Alias table –a[recipient]=donor –q[N] = non-alias probability Sampling x j : random u 1, u 2 –bin index: N x u 1 = i +  –sample j = (q [i] < u 2 ) ? i : a[i] –x j = [  j + (1-  ) (j+1)]  x S.Y. ASCR-HEP, 1/30/201520

Alias Method: Validation and Performance Differential cross sections of EM processes (ex:scatter angle) Sizable performance gains both in CPU and GPU S.Y. ASCR-HEP, 1/30/201521

Coalesced Memory Access Sampling the step length and the associated physics process –cross section calculations on-the-fly (fully vectorizable, but may be costly) –Tabulated physics (gather operation for table look-ups, bandwidth limited) Rearrange data to enable contiguously ordered memory accesses Overhead?: reallocate data to the stack (< gain by vectorization) Sizable performance gains both in CPU and GPU S.Y. ASCR-HEP, 1/30/201522

Portability (Template Approach): Scale, Vector, CUDA, MIC S.Y. ASCR-HEP, 1/30/ Common interface for different architectures (Backend)

Plan Implement one fully vectorized EM physics model (Klein Nishina Compton) and test with GeantV by CHEP2015 –Vector, Scalar, CUDA –Performance evaluation and validation (tabulated and Geant4) Complete all EM physics –establish a backend schema for the vector physics package extend for hadron physics and explore other algorithms S.Y. ASCR-HEP, 1/30/201524

Vision: Ancien Régime to Liberty S.Y. ASCR-HEP, 1/30/ Failure of architecture-aware algorithms (and vice versa) is the dawn of new agies. Scientific advancement is not evolutionary, but rather is a “series of peaceful interludes punctuated by intellectually violent revolutions”, and in those revolutions “one conceptual world view is replaced by another” (Thomas Kuhn).