R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Performance Evaluation of Parallel Processing. Why Performance?

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

Independent Component Analysis (ICA) A parallel approach.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast – Task parallelism in hybrid configurations – Instruction.

Orbited Scaling Bi-directional web applications A presentation by Michael Carter

Status of the vector transport prototype Andrei Gheata 12/12/12.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.

Classic Model of Parallel Processing

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

AITI Lecture 18 Introduction to Data Structure, Stack, and Queue Adapted from MIT Course 1.00 Spring 2003 Lecture 23 and Tutorial Note 8 (Teachers: Please.

Processor Architecture

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

VMC workshop1 Ideas for G4 navigation interface using ROOT geometry A.Gheata ALICE offline week, 30 May 05.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

Advanced Computer Networks Lecture 1 - Parallelization 1.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!

Requirements for the O2 reconstruction framework R.Shahoyan, 14/08/

Concurrency and Performance Based on slides by Henri Casanova.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

New features in ROOT geometrical modeller for representing non-ideal geometries René Brun, Federico Carminati, Andrei Gheata, Mihaela Gheata CHEP’06,

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

18-447: Computer Architecture Lecture 30B: Multiprocessors

ALICE experience with ROOT I/O

Parallel Processing - introduction

Threads Cannot Be Implemented As a Library

6/16/2010 Parallel Performance Parallel Performance.

Atomic Operations in Hardware

Atomic Operations in Hardware

Exploiting Parallelism

Ideas for G4 navigation interface using ROOT geometry

Report on Vector Prototype

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Chapter 4: Threads.

Lecture 5: GPU Compute Architecture for the last time

Chapter 4: Threads.

Parallelism and Concurrency

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

ECE 498AL Lecture 15: Reductions and Their Implementation

CS510 - Portland State University

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4: Threads & Concurrency

Problems with Locks Andrew Whitaker CSE451.

Presentation transcript:

R EDESIGN OF TG EO FOR CONCURRENT PARTICLE TRANSPORT ROOT Users Workshop, Saas-Fee 2013 Andrei Gheata

W HY " PARALLELIZING " GEOMETRY ? Key component in many HEP applications Simulation MC, reconstruction, event displays, DB’s, … Wrapper to 3D or material information Positions, sizes, matrices, alignment, densities, … Optimizing this use case is application responsibility E.g. event display + geometry Some of the HEP applications use directly geometry functionality, namely navigation Namely transport MC and tracking code As these applications work or will work concurrently, geometry has to follow… Enhance intrinsic geometry performance… 2 Redesign of TGeo for concurrent particle transport

W HAT KIND OF PARALLELISM ? Sequential usage: 3 Redesign of TGeo for concurrent particle transport QueryPropagateQueryPropagate  Parallel transport of events or tracks: QueryPropagateQueryPropagateQueryPropagateQueryPropagate  Parallel local vector transport: Local Query Local Propagate Local Query Local Propagate Local Query Local Propagate Local Query Local Propagate

H OW TO IMPLEMENT IT ? Thread safety Achieved and covered in this presentation Navigation is the concurrent part – provide a parallelism model here Task-based parallelism (e.g. different tracks/events to different threads) Preserve or improve locality Hierarchical navigation, difficult to factorize WITHIN a query Context cannot be stored in the data structures -> propagate with query Further finer grain concurrency – solids Main CPU consumer in geometry GPU kernels, other coprocessors Context exchange latency can be a limiting factor – to be tested Vectorization – ideal for low level computation Requires vectorized navigation methods AND an application sending vector queries Further discussed in next slides 4 Redesign of TGeo for concurrent particle transport

G EOMETRY DATA STRUCTURES ROOT geometry was NOT thread safe by design State backed up in common geometry structures (optimization purposes) Many methods, including simple getters, were not thread safe The context dependent content not clearly separated from the const 5 Redesign of TGeo for concurrent particle transport class TGeoPatternFinder : public TObject { … Double_t fStep; // division step length Double_t fStart; // starting point on divided axis Double_t fEnd; // ending point Int_t fCurrent; // current division element Int_t fNdivisions; // number of divisions Int_t fDivIndex; // index of first div. node TGeoMatrix *fMatrix; // generic matrix TGeoVolume *fVolume; // volume to which applies Int_t fNextIndex; //! index of next node

R E - DESIGN STRATEGY Thread safety without sacrificing existing optimizations Step 1: Split out the navigation part from the geometry manager Most data structures here depend on the state Different calling threads will work with different navigators Step 2: Spot all thread unsafe data members and methods within the structural geometry objects and protect them Shapes and optimization structures Convert object->data into object->data[thread_id] Step 3: Rip out all context-dependent data from structural objects to keep a compact const access geometry core Whenever possible, percolate the state in the calling sequence 6 Redesign of TGeo for concurrent particle transport

P ROBLEMS ALONG THE WAY Separating navigation out of the manager was a tedious process Keeping a large existing API functional Spotting the thread-unsafe objects was not obvious Practically all work done by Matevz Tadel (thanks!) Changing calling patterns was sometimes impossible, resources needed to be locked First approach suffered a lot from Amdahl law Many calls to get the thread Id needed, while there was no implementation of TLS in ROOT __thread not supported everywhere 7 Redesign of TGeo for concurrent particle transport

TGeo Navigator I MPLEMENTATION Thread data pre-alocated via TGeoManager::SetMaxThreads() User threads have to ask for a navigator via TGeoManager::CreateNavigator() Internal access to a stateful data member goes via: fStateful ↔ GetThreadData(tid)->fStateful Non-API methods use percolation: Method(data) ↔ Method(data, &context) 8 Redesign of TGeo for concurrent particle transport Analysis manager TGeo Navigator TGeo Navigator TGeo Navigator Stateless const TGeoManager::ThreadId() Data structures static __thread tid=0 tid=1tid=2tid=3 statefulstatefulstatefulstateful struct ThreadData_t {statefull data members;} mutable std::vector fThreadData

U SAGE 9 Redesign of TGeo for concurrent particle transport //_________________________________________________________ MyTransport::PropagateTracks(tracks) { // User transport method called by the main thread gGeoManager->SetMaxThreads(N); // mandatory SpawnNavigationThreads(N, my_navigation_thread, tracks) JoinNavigationThreads(); } void *my_navigation_thread(void *arg) { // Navigation method to be spawned as thread TGeoNavigator *nav = gGeoManager->GetCurrentNavigator(); if (!nav) nav = gGeoManager->AddNavigator(); int tid = nav->GetThreadId(); // or TGeoManager::ThreadId() PropagateTracks(subset(tid,tracks)); return 0; }

SCALABILITY Good scalability with rather small Amdahl effects (~0.7 % sequential No lock on memory resources however ! Work balancing is not perfect in the test Small overheads due to several hidden effects TreadId calls, false cache sharing, pthread calls May need to re-organize context data per thread rather than per object 10 Redesign of TGeo for concurrent particle transport

V ECTORIZING THE GEOMETRY Assuming the work can be organized such that: Several particles query the same solid (vector navigator) Difficult to achieve, requires vectorized transport A single particle queries several solids of the same type (content navigator) Requires modifying the current navigation optimizers per volume, as well as a "decent" volume content What could be the gain? Achieving sizeable speedup factors via SIMD features built into all modern processors Starting the exercise is needed in order to assess how much 11 Redesign of TGeo for concurrent particle transport

TOY EXAMPLE 12 Redesign of TGeo for concurrent particle transport TGeoBBox::Contains(Double_t *point) { if (fabs(point[0])>fDX) return false; if (fabs(point[1])>fDY) return false; if (fabs(point[2])>fDZ) return false; return true; } void ContainsNonVect(Int_t n, point_t *a, bool *inside) { int i; for (i=0; i<n; i++) { inside[i] = (fabs(a[i].x[0]<=fDX) & (fabs(a[i].x[1]<=fDY) & (fabs(a[i].x[2]<=fDZ) } void ContainsVect(Int_t n, point_t * __restrict__ a, bool *inside) { int i; for (i=0; i<n; i++) { double *temp = a[i].x; inside[i] = (fabs(*temp)<=fDX) & (fabs(*(temp+1))<=fDY) & (fabs(*(temp+2))<=fDZ) } sec 5.4 sec (x2 original) 3.74 sec (x3 original) can expect max. x2 non-vect

O VERVIEW Thread safety for geometry achieved, introducing 1-2% overhead compared to the initial version Additional ThreadId() and GetThreadData() calls Fast and portable thread ID retrieval implemented via ThreadLocalStorage.h __thread for Linux/AIX/MACOSX_clang __declspec(thread) on WIN pthread_(set/get)specific for SOLARIS, MACOSX Parallel navigation has to be enabled: gGeoManager->SetMaxThreads(N) One navigator: per thread: gGeoManager->AddNavigator() No locks, very good scalability Small overhead to be adressed by reshuffling context data 13 Redesign of TGeo for concurrent particle transport

N EXT STEPS Geometry navigation is not a simple parallel problem Re-shuffling of transport algorithms will be needed Propagating vectors from top to bottom When available from the tracking code… Re-think algorithms for solids from this perspective Factorizing loops within a volume Low level optimization at the level of voxels Data flow model and locality to be thought over Minimizing latencies and cache misses when using low-level computational units (GPU) Vectorizing the geometry can potentially gain an important factor The time scale is few years, but the work has to start now 14 Redesign of TGeo for concurrent particle transport