GEant4 Parallelisation J. Apostolakis. Session Overview Part 1: Geant4 Multi-threading C++ 11 threads: opportunity for portability ? Open, revised and.

Slides:

Advertisements

Similar presentations

© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.

Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Threads. Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Chapter 4: Threads. 4.2CSCI 380 Operating Systems Chapter 4: Threads Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

GEant4 Parallelisation J. Apostolakis. Session Overview Part 1: Geant4 Multi-threading C++ 11 threads: opportunity for portability ? Open, revised and.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Geant4 MT: an update J. Apostolakis for Geant4-MT developers Xin Dong, Gene Cooperman (Northeastern Univ.) Makoto Asai, Daniel Brandt (SLAC) J. Apostolakis,

Thread-safety and shared geometry data J. Apostolakis.

- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

CSC 322 Operating Systems Concepts Lecture - 7: by Ahmed Mumtaz Mustehsan Special Thanks To: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.

23/2/2000Status of GAUDI 1 P. Mato / CERN Computing meeting, LHCb Week 23 February 2000.

Computing Performance Recommendations #10, #11, #12, #15, #16, #17.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Sunpyo Hong, Hyesoon Kim

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

General Introduction and prospect Makoto Asai (SLAC PPA/SCA)

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Toward Geant4 version 10 Makoto Asai (SLAC PPA/SCA) For the Geant4 Collaboration Geant4 Technical Forum December 6 th, 2012.

Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Multi-threading and other parallelism options J. Apostolakis Summary of parallel session. Original title was “Technical aspects of proposed multi-threading.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to threads

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 4: Multithreaded Programming

OPERATING SYSTEM CONCEPT AND PRACTISE

Chapter 4: Threads.

Geant4 MT Performance Soon Yung Jun (Fermilab)

Parallelized JUNO simulation software based on SNiPER

Kilohertz Decision Making on Petabytes

Multicore, Multithreaded, Multi-GPU Kernel VSIPL Standardization, Implementation, & Programming Impacts Anthony Skjellum, Ph.D.

Chapter 4: Multithreaded Programming

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 4: Threads.

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

Presentation transcript:

GEant4 Parallelisation J. Apostolakis

Session Overview Part 1: Geant4 Multi-threading C++ 11 threads: opportunity for portability ? Open, revised and new requirements (from HEP experiments) Part 2: Beyond MT Geant4 on GPUs: prototypes The ‘Geant’ prototype - moving towards Vector

Part 1: Geant4MT & new requests

Geant4 MT - major topics New Requirements (2012) Extending model of parallelism (TBB, dispatch) - CMS Adapting to HEP experiment frameworks Folding of Geant4-MT into Geant4 release-X (end 2013) Streamline for maintainability,... Need to assess and ensure the compatibility of these directions

Geant4MT - Background What is Geant4 MT ? Goals, design,.. see background slides in Addendum (Purple header) Implementation is the PhD-thesis work of Xin Dong (NorthEastern Univ.) under the supervision of Prof. Gene Cooperman, in collaboration with me (J.Ap.) Updated to G4 9.4p1 (X+D+M+G), & 9.5p1 by Daniel, Makoto and Gabriele. Excellent speedup from 1-worker to 40+ workers - see CHEP 2012 posterCHEP 2012 poster But: Overhead vs Sequential found (first reported by Philippe Canal, 2011)

Geant4 MT Prototype - brief update MT updated to Geant4 9.5 patch Aug (Daniel Brandt, Makoto, Gabriele) Improved integration of parallel main(); Corrected inclusion of tpmalloc. Improvements to ‘one-worker’ overhead - now decreased from 30% to 18% (Xin) Due to the interaction of Thread Local Storage (TLS) and dynamic libraries

Goals of Part 1 Geant4 MT and its future Evaluate whether C++ 11 threads can replace pthreads (soon) Identify issues, roadblocks for ‘on-demand’ version of G4MT Note issues which arise from other new requirements.

Topics of Part 1 - Geant4MT C++ 11 Threads and Portability Talk by Marc Paterno Request for support of ‘on demand’ parallelism Talk in plenary by Chris J., Liz S.-K. (CMS) New trial usage in ATLAS ISF Discussion on these & related topics

C++ 11 threads: Do C++11 ‘standard’ threads enable better portability (than pthreads) ? What other benefits can C++11 threads offer ? Are they available today - or soon ?

CMS & on-demand event simulation Plenary presentation ( Chris Jones, Eliz. Sexton-Kennedy ) Request integration into on-demand event simulation workload is handled by outside framework (CMSsw, TBB= Thread Building Blocks) unit of work: a full event. What is required to adapt Geant4-MT to ‘on-demand’ / dispatch parallelism ? Key topic of Discussion session

ATLAS input Developing trial use - in new Integrated Simulation Framework (ISF) Passes one track at a time, packaged as a G4 ‘event’ - for each primary or one entering a sub-detector Sub-event level parallelization - using ‘event-level’ parallel Geant4- MT This is the first use of this capability / potential

The ‘one-worker’ slowdown Need more benchmarks and profiling. Current known causes: interaction of Thread Local Storage (TLS) and dynamic libraries? extra calls to get_thread_id() - in singleton TLS and our “TLS for objects” Can we avoid the slowdown due to interaction of (TLS) and dynamic libraries? Proposal : try putting all of G4 into one shared library Or put the core - ‘nearly all’ - into one library, excluding only auxiliaries: persistency, visualization.

Other Topics for Discussion Your issues here

Part 2: Beyond threads/tasks

Overview Need for more events by LHC/HEP experiments, medical users,.. Challenge in CPUs: instruction fetch is bottleneck due to ‘granular’ OO methods, large number of branches, code size large compared to caches. Each instruction, method does too little work How to get more out of each instruction - and utilize the emerging architectures: GPUs, MIC, CPU with wider SIMD execution units? Explore GPUs and Vectors

Opportunities CPU evolution - wider Vector Units + instructions: Widespread: CPUs with128-bit units = 2 doubles or 4 floats Emerging: 256-bit (AVX) = 4 doubles or 8 floats GPUs: SIMD hardware, specialised languages (CUDA, OpenCL) Hundreds of ‘threads’, tens of - one MultiProcessor MIC: New public information - wide Vectors, 4 threads per core, ~60 cores

G4 - GPU efforts: external GPU efforts external to G4 Collaboration hGATE project, full gamma processes (2011), e- in progress: CUDA G4MCD - team in Germany, both gamma and e- These efforts focus on use in medical physics Simple geometry (regular voxel volumes - trivial geometry, no Navigator.) In touch with hGATE, part of OpenGATE: D. Visvikis (Brest)

Intro to Geant4-MT J. Apostolakis

Outline of the Geant4-MT design There is one master thread that initialises and spawns workers; and several worker threads that execute all the ‘work’ of the simulation. The unit of work for a worker is a Geant4 event o limited sub-event parallelism was foreseen by splitting a physical event (collision or trigger) into several Geant4 events. Choice: limit changes to a few classes o other classes have a separate object for each worker

Goals of Geant4-MT Key goals of G4-MT allow full use of multi-core hardware (including hyper-threading) reduce the memory footprint by sharing the large data structures enable use of additional threads within limited memory reduce cost of memory accesses. Looking forward - a personal view: Medium term goals: make Geant4 thread-safe (Geant4 X - Dec 2013) o for use in multi-threaded applications. Longer term goal o increase the throughput of simulation by enabling the use of additional resources: co-processors and/or additional hardware threads.

Limit extent of changes The choice was to concentrate revisions to a few classes o to reduce the effort required to create, test and maintain it The few classes that are changed are ones that o manage the event loop o touch geometry objects with multiple physical instances (replicas etc.) o must share cross-sections for EM processes, o which create or configure the above classes. All other classes are unchanged o a separate object is created by each worker.

Implementation Uses the POSIX threads library (pthreads) o currently works only on Linux. Global data is separated by thread o using the gcc construct __thread - this includes singletons. The master thread initializes all data o reads all parameters and starts the other threads; Instances of separate objects are cloned by each worker o copying the contents of all these objects in the master thread ( shallow copy or deep copy ? )

'Split' classes Some classes are split: o part of their data is shared, and o part is thread local. Shared data o is typically invariant in the event loop o but also 'joint' and updated: ion table, particle table. Implementation - customized methodology o each instance of split object has an integer id o instantiates an array of stub object for each thread o an object uses the entry in the array - index= int id o the (sub-)object data is initialised by the worker thread that uses it.