Perspective on Geant4 Parallelization

Perspective on Geant4 Parallelization
Koichi Murakami KEK/CRC

Outline Scope of this talk
Review of each Geant4 parallelization technology past / current /on-going /future

Scope of this talk An elder person says:
“Speed up Geant4 is a Clear and Present Issue” for many years We should answer from practical point of view. What solution can we provide for users using multi-core PC, PC cluster? GRID-wise / Local batch job submission model is out of scope. Not technology but methodology Ordinary people do not use GRID. Not a Geant4 issue Review of each Geant4 parallelization Past / Current TOP-C DAIANE MPI On-going multi-threading Future? GPU from both sides of developers and users Software Technology Methodology

Current situation of Geant4 parallelization
Users can see in /examples/extended/parallel/ ExDiane/ : DIANE MPI/ : MPI ParaN02/ : TOP-C ParaN04/ : TOP-C README : TOP-C info/ : TOP-C Users’ impression: Geant4 does not officially support parallelization? What is DIANE? What is TOP-C? Yes, I know the word MPI.

TOP-C vs. MPI TOP-C MPI Method Task-oriented over MPI Message passing
Library Private Library Standard protocol Many compliant libraries Installation Self installation Included many Linux distributions Parallelization level Event-base Run-base Effects on users G4RunManager -> ParRunManager G4MPIsession One of UIsession Required skill High Easy

MPI support for MC packages
GEANT3 Users’ activities? EGS4/5 Users’ activities MCNPX Support PHITS MARS Geant4 Support? We should send a message to users?

A user’s voice Multi-core CPUs are easy to get and run. At present, up to 4 (6/8) core CPU is available. Though the current Geant4 is not multi-threaded, it can make profit from multi-core CPU. I tried a few benchmarks. exMPI01 in geant4.9.3/examples/extended/parallel/MPI by Koichi. Koichi implemented the generic MPI interface for Geant4. In particular, /mpi/beamOn command is automatically broadcasted to all slaves. So, if each slave runs in parallel, we can expect speed up. With Intel core i7 and AMD phenom2, I measured and confirmed speed up. For example, core i7 took 122 seconds (single thread), 62 seconds (2 threads) and 34 seconds (4 threads) wall clock time. Therefore, with the current geant4 + MPI, we can expect a speed up up proportional to the number of cores. Preference to MPI rather than Top-C As Koichi stated, it is relatively easy for users to modify codes compliant with MPI. Based on MPI, we can make a PC cluster with reasonable cost and speed up.

Another user’s experience

Implementation of G4MPI
Simple master-slave model UI commands are broadcasted from the master to slaves. Implemented in “Interface” layer. There is no needs of special RunManager etc. for parallel. G4MPISession Master broadcast UI commands Message Passing return responses G4MPISession Slave G4MPISession G4MPISession G4MPISession G4MPISession G4MPISession

Proposal on MPI MPI interface (core library) is moved from “extended examples” to “interface.” MPI interface is implemented as one of UI session. So this is natural conclusion. Example codes are left in extended examples. Concerns on external library dependency can be treated as GDML case. additional flags on configuration ENABLE_MPI, MPI_LIBRARY_PATH,…

Geant4 multi-threading
On-going eager approach by Xin, Gene, JA. Benefit of Multi-threading: Efficient memory consumption MPI approach consumes x(#processes) memory. Large geometry, cross section data can be shared among multi- threads. Cons: Development cost is very high. parallelization of old program warranty of thread-safety most of STL are not thread-safe in general.

Quick look at Geant4 Multi-threading
Parallel run manager, same as TOP-C approach not by MPI, but by multi-threading TLS (Thread-local-storage) static/global variables to thread-local with “__thread” (gcc) Automatic TLS conversions with patched C++ parser. For non-thread-safe variables? lock with mutex : performance bottle neck Push thread localization… Next Steps: read-write data: Check with Valgrind changes to array data with # of instances declare as static TLS Set/Get methods should be modified to access by instance id.

Next step : memory reallocation approach
parallel loop parallel event-loop read-only data check using Valgrind G4int G4PVReplica :: fcopyNo read/write data read/write data read/write data read/write data read/write data G4int G4PVReplica :: instance_id TLS static __thread G4int* G4PVReplica :: fcopyNo G4int G4PVReplica :: GetCopyNo() const { return fopyNo[instance_id]; } Thread localized!! To use TLS, a variable should be static, which means that the variable is shared among all instances in each thread.

Programmer’s way : algorithmic approach
Let’s consider multi-events processing Multi track objects in different events concurrently access to G4PVReplica objects. Parameterized geometry is not thread-safe for multi-particle tracking. “CopyNo” of G4PVReplica should be assigned for each track of concurrent events “CopyNo” attribute should be moved to G4Track. And updated by Navigator in each step. In reality, this may not be enough.

My Impression From computer science view, it is very interesting and challenging. Any applications can be semi-automatically multi-threaded in a single prescription. But, from users point of view Really thread-safe? Unsure… Thread localization = Thread safety? Still there is a question. Very very tough job for making users’ applications multi-thread. Comments as a Geant4 developer : Current Geant4 is not designed for multi-threading ! We should remember that STL is not thread-safe generally ! We should consider concurrent algorithm and redesign of object model.

GPU Computing – Future? Some users are interesting in GPGPU of Geant4.
Actually, there is a paper of Geant4 on GPU. x20 faster I guess this is not what users expect… Current Geant4 can not run on GPU as it is (with small modifications) !! Fermi architecture supports full C++. Only for data processing Not for object modeling on GPU (new/delete) In future, it will be improved. GPU way does not match object modeling like Geant4. SIMD (Single Instruction for Multi Data) approach Particle method is generally very suitable for GPU. Algorithm-driven methods and simplified object model can work well with GPU. Use current Geant4 as static data (material data, cross section, particle data)

Actions for Speed up Geant4 : “ Clear and Present Issue”
Most practical solution: MPI support as other MC packages The case of huge memory consumption is not applicable. (e.g. DICOM application) Technical challenges: Geant4 multi-threading The usage might be limited. Our future: where should we go? Stay here or move on to the next stage? Redesign? Software/Hardware technology is changed since 15 years ago.

Thanks

Perspective on Geant4 Parallelization

Similar presentations

Presentation on theme: "Perspective on Geant4 Parallelization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perspective on Geant4 Parallelization

Similar presentations

Presentation on theme: "Perspective on Geant4 Parallelization"— Presentation transcript:

Similar presentations

About project

Feedback