School of Nuclear Engineering

School of Nuclear Engineering
The Application of POSIX Threads And OpenMP to the U.S. NRC Neutron Kinetics Code PARCS D.J. Lee and T.J. Downar School of Nuclear Engineering Purdue University I’d like to talk about the parallel application of PARCS code July, 2001

Contents Introduction Parallelism in PARCS
Parallel Performance of PARCS Cache Analysis Conclusions This is the order of this presentation. Some Introduction of PARCS code and parallel applications of PARCS and Cache Analysis and finally conclude this presentaion.

Introduction

PARCS “Purdue Advanced Reactor Core Simulator”
U.S. NRC(Nuclear Regulatory Commission) Code for Nuclear Reactor Safety Analysis Developed at School of Nuclear Engineering of Purdue University A Multi-Dimensional Multi-Group Reactor Kinetics Code Based on Nonlinear Nodal Method PARCS stands for Purdue Advanced Reactor Core Simulator. This code is an US Nuclear Regulatory Commission’s official code developed here at Purdue University for Nuclear Reactor Transient Analysis.

Nuclear Power Plant Nuclear Reactor Core
This figure is a simple diagram of nuclear power plant. This one here is a nuclear reactor, and this one is a steam generator, turbine, and electricity generator. PARCS code is used for the analysis of this reactor core. Also PARCS code can be run coupled with external system code for the analysis of the whole system. Nuclear Reactor Core

Equations Solved in PARCS
Time-Dependent Boltzmann Transport Equation T/H Field Equations Heat Conduction Equation Heat Convection Equation PARCS solves this Bolzmann transport equation. This psi is the variable to be solved which represents the behavior of neutron in the reactor And this constant which is called “cross section” represents a material property. Generally, cross section is a function of fuel temperature and coolant density. So, PARCS also solves Thermal Hydraulics equations.

Spatial Coupling Thermal-Hydraulics: Neutronics:
Computes new coolant/fuel properties Sends moderator temp., vapor and liquid densities, void fraction, boron conc., and average, centerline, and surface fuel temp. Uses neutronic power as heat source for conduction Neutronics: Uses coolant and fuel properties for local node conditions Updates macroscopic cross sections based on local node conditions Computes 3-D flux Sends node-wise power distribution Reactor core is discretized into a lot of nodes for numerical methods. This is the nodalization for the neutronics field equations. and this one represents TH nodalization. Different nodalizations can be used for neutronics and TH calculation. Furthermore, different external code can be used for TH calculation.

High Necessity of HPC for PARCS
Acceleration Techniques in PARCS Nonlinear CMFD Method : Global(Low Order)+Local(High Order) BILU3D Preconditioned BICGSTAB Wielandt Shift Method Still, Computational Burden of PARCS is Very Large Typically, The Calculation Speed is More Than an Order of Magnitude Slower Than Real Time Example NEACRP Benchmark Several Tens of Seconds for 0.5 sec. Simulation PARCS/TRAC Coupled RUN 4 Hours for 100 sec. Simulation PARCS already has state-of-the-art acceleration techniques, such as Nonlinear Coarse Mesh Finite Difference Method which consists of a global low order finite difference method and a local high order nodal method And Krylov Subspace Method with a efficient Block ILU preconditioner, and Wielandt Method for the acceleration of outer iteration . But still the computational burden is very large. Typically, Compared to the real time, the calculation speed of PARCS is more than an order of magnitude slower. So, PARCS need High Performance Computing.

Parallelism In PARCS

PARCS Computational Modules
CMFD: Solves the “Global” Coarse Mesh Finite Difference Equation NODAL: Solves “Local” Higher Order Differenced Equations XSEC: Provides Temperature/Fluid Feedback through Cross Sections (Coefficients of Boltzmann Equation) T/H: Solution of Temperature/Fluid Field Equations PARCS consists of four calculation modules. CMFD module for global calculation NODAL module for local calculation XSEC module for cross section feedback TH module for Thermal Hydraulics calculation.

Parallelism in PARCS NODAL and Xsec Module: T/H Module: CMFD Module:
Node by Node Calculation Naturally Parallelizable T/H Module: Channel by Channel Calculation CMFD Module: Domain Decomposition Preconditioning Example: Split the Reactor into Two Halves The Number of Iteration Depends on the Number of Domains NODAL module, XSEC module, and TH module is naturally parallelizable. Because the calculations are performed node-by-node or channel-by-channel. But, CMFD module is basically a Finite Difference Formulation with some correction factors given from the local solutions. Every node is coupled as a 7-strip matrix form So, Domain Decomposition is used for parallelization For example, the core can be divided into two halves like this figure , upper core and lower core.

Why Multi-Threaded Programming ?
Coupling of Domains The Information of One Plane at the Interface of Two Domains Should Be Transferred to Each Other The Size of Information to be Exchanged is NOT SMALL Compared with the Amount of Calculations for Each Domain Message Passing Large Communication Overhead  Multi-Threading Shared Address Space Negligible Communication Overhead  The Multi-threaded programming technique is used for parallel PARCS. Because the size of informations to be exchanged between each domain for domain coupling, is not so small, compared with the calculation of each domain, We chose the small communication overhead technique.

Multi-threaded Programming
OpenMP FORTRAN, C, C++ Simple Implementation based on Directives POSIX Threads No Interface to FORTRAN Developed FORTRAN-to-C Wrapper Much Caution Required to Avoid Race Conditions We have two versions of parallel PARCS. OpenMP version and POSIX threads version. Because POSIX Threads does not support FORTRAN language and PARCS is coded by FORTRAN, We developed wrapper functions for POSIX threads which can be called from PARCS code.

POSIX THREADS WITH FORTRAN: nuc_threads
Mixed language interface accessible to both Fortran and C sections of the code Minimal set of threads functions: nuc_init(*ncpu): initializes mutex and condition variables. nuc_frk(*func_name,*nuc_arg,*arg): creates the POSIX threads. nuc_bar(*iam): used for synchronization. nuc_gsum(*iam,*A,*globsum): used to get a global sum of an array updated by each thread. These four functions, init, fork, barrier, and global sum, are the wrappers used for the implementation of Pthreads version parallel PARCS

Implementation of OpenMP and Pthreads
Begin Begin Fork Fork Thread 1 Thread 2 Thread 1 Thread 2 Join Synchronization OpenMP idle Pthreads Fork OpenMP version and Pthreads version parallel PARCS codes, basically, do the same calculation. But the implementation is a little bit different. For OpenMP, multiple threads are forked when they are needed and joined after finishing parallel work. This part and this part is the parallel sections, multiple threads exist. For Pthreads, multiple threads are forked at the beginning of the execution and stay alive until the whole calculations finished. Synchronization was implemented by wrapper function “barrier”. At this part, thread 1 and 2 works together. And after finishing parallel work, two threads are synchronized and thread 2 is idling while thread 1 is still working. The purpose of this kind of implementation is to reduce any overhead related to the thread forking. Even though it caused more complexity for programming. In the programming point of view, OpenMP is easier than POSIX threads. Especially, the debugging of POSIX threads took much more effort than OpenMP Thread 1 Thread 2 Synchronization Join Join End End

Parallel Performance of PARCS

Applications Matrix Vector Multiplication
Subroutine “MatVec” of PARCS Size of Matrix Is Same As NEACRP Benchmark NEACRP Reactor Transient Benchmark Control Rod Ejection From Hot Zero Power Condition Full 3-Dimensional Transient The first application is a simple matrix vector multiplication Using the subroutine of parallel PARCS. PARCS has a matrix vector multiplication subroutine for Krylov solver. Second application is a realistic benchmark problem. NEACRP benchmark is Control rod ejection transient of reactor core.

Specification of Machine
Platform SUN ULTRA-80 SGI ORIGIN 2000 Number of CPUs 2 32 CPU Type ULTRA SPARC II 450 MHz MIPS R10000 250 MHz 4-way superscalar L1 Cache 16 KB D-cache 16 KB I-cache Cache Line Size : 32bytes 32 KB D-cache 32 KB I-cache Cache Line Size : 32bytes L2 Cache 4MB 4MB per CPU Cache Line Size : 128bytes These are machine specifications. SUN and SGI workstation. SGI machine has more CPUs , larger Caches, and larger main memory. FORTRAN 90 compiler was used for both machine Because PARCS used F90 features intensively, for example, f90 module and dynamic memory allocation. Main Memory 1GB 16GB Compiler SUN Workshop 6 -FORTRAN MIPSpro Compiler 7.2.1 - FORTRAN 90

Specification of Machine
Platform LINUX Machine Number of CPUs 4 CPU Type Intel Pentium-III 550 MHz L1 Cache 16 KB D-cache 16 KB I-cache Cache Line Size : ? bytes L2 Cache 512KB These are machine specifications. SUN and SGI workstation. SGI machine has more CPUs , larger Caches, and larger main memory. FORTRAN 90 compiler was used for both machine Because PARCS used F90 features intensively, for example, f90 module and dynamic memory allocation. Main Memory 1GB Compiler NAGWare FORTRAN 90 Version 4.2 ftp://download.intel.com/design/PentiumIII/xeon/datashts/ pdf Slot 2 technology, 100MHz bus, non-blocking cache

Matrix-Vector Multiplication (MatVec Subroutine of PARCS)
Machine Serial OpenMP Pthreads 1*1) 2 4 8 1 2 4 8 SUN 3.76 23.43 13.26 - (0.16) (0.28) 3.71 *2) 1.93 - (1.02) *3) (1.95) SGI 1.73 1.73 0.92 0.52 0.37 (1.00) (1.89) (3.30)*4) (4.72) 1.72 1.80 1.91 1.96 (1.01) (0.96) (0.91) (0.88) This is the first application. Simple Matrix Vector multiplication. Pthreads shows good speedup on SUN machine for 2 threads. This machine has only two CPUs. So, just two-thread case is tested. On SGI machine, OpenMP shows a good speedup for two-thread This is comparable to Pthreads performance on SUN machine. Pthread on SGI machine does not show any speedup for this application because all threads are scheduled to the same CPU. Sometimes threads are scheduled to different CPUs. But it is too rare. The thread scheduling is uncontrollable as long as we stay in Pthreads, it is totally under the control of OS. By the way, We tested another simple C program using POSIX threads on SGI machine. It shows good speedup. It seems that the POSIX thread scheduling problem is not a common problem. But it is not clear why parallel PARCS has such a problem. Next one is OpenMP on SUN machine. The result is strange. Even with just one thread, the execution time increased three times longer. But This phenomena occurs only when dynamic memory allocation is used. If we use “common” block instead of module, the speedup is reasonable. *1) Number of Threads *4) Core is Divided into 18 Planes *2) Time(seconds) *3) Speedup

Matrix-Vector Multiplication (Subroutine of PARCS)
SUN Serial Run Time: 3.76 s SGI Serial Run Time: 1.73 s

NEACRP Benchmark (Simulation with Multiple Threads)
This is second application. A more realistic benchmark problem. PARCS uses domain decomposition. So, the total number of iteration depends on the number of domains. In other words, the more domains, the more works done by PARCS. At first, we confirmed the quality of solutions is same regardless of different number of threads

Parallel Performance (SUN)
Module Serial Pthreads 1*) 20.8 1.77 6.4 1.78 14.5 2.04 3.7 456 - 33 216 2 Speedup 45.5 1.88 226 Time (sec) CMFD 36.7 32.1 Nodal 11.5 11.3 T/H 29.6 27.9 Xsec 7.6 7.1 Total 85.4 78.5 # of Updates CMFD 445 445 This table is POSIX threads on SUN machine The number of iteration increased by about 2 % with 2 domains. Nevertheless, parallel performance is good. The Overall speedup is 1.88 with 2 threads. This is a Good performance. About the speedup of each module: TH module and XSEC module shows super linear speedup CMFD and NODAL module ???? Nodal 31 31 T/H 216 216 Xsec 225 225 *) Number of Threads

Parallel Performance (SGI)
Module Serial OpenMP 1 *1) 12.1 1.63 5.8 1.55 12.3 2.17 2.4 2.01 456 - 33 216 2 Speedup 32.6 1.85 226 8.93 2.21 8.85 2.23 3.56 2.53 2.87 3.14 8.92 2.99 7.14 3.73 1.37 3.53 1.11 4.35 497 - 565 38 39 216 217 4 Speedup 8 22.8 2.64*2) 20.0 3.02*2) 228 227 Time (sec) CMFD 19.8 19.3 Nodal 9.0 9.2 T/H 26.6 25.3 Xsec 4.8 4.4 Total 60.2 58.1 # of Updates CMFD 445 445 This is OpenMP on SGI machine. Speedup with 2 threads is Good performance comparable to POSIX threads. For the speedup of each module: TH module and XSEC module shows super linear speedup The speedup of CMFD and NODAL module is low. Why? The speedup with 4 threads and 8 threads is not so good because the number of iteration increased and another reason is that the load is not balanced. The total number of planes of this benchmark is 18. 18 divided by 4 or 8 is not an integer. Nodal 31 31 T/H 216 216 Xsec 225 225 *1) Number of Threads *2) Core is divided into 18 planes

Cache Analysis

Typical Memory Access Cycles (SGI)
Memory Access Time CPU Typical Memory Access Cycles (SGI) Memory Access Type Cycles L1 Cache L1 cache hit 2 L2 Cache This is the typical memory access time on SGI Origin 2000 machine. There exists a large difference between L2 cache hit and L2 cache miss. So, L2 cache is important for good cache performance. L1 cache miss satisfied by L2 cache hit 8 L2 cache miss satisfied from memory 75 Memory

Cache Miss Measurements (SGI)
Module Cache Serial OpenMP 1*1) 2 4 8 CMFD (BICG) L1 477,691 479,474 258,027 156,461 105,733 L2 28,242 29,650 17,007 11,751 9,309 Nodal L1 857,744 853,866 444,849 249,507 160,699 L2 54,163 55,534 33,846 19,016 12,848 T/H (TRTH) L1 165,133 60,587 39,419 25,850 19,816 The L1 cache misses and L2 cache misses are measured for each module on SGI machine using Hardware counter. L2 9,551 9,512 9,673 6,451 4,620 XSEC L1 62,324 57,462 29,845 17,715 11,344 L2 9,456 9,518 5,517 3,737 2,578 *1) Number of Threads

Cache Miss & Speedup of XSEC Module (SGI)
If we take a look at the relation of Cache miss and speedup as in this figure, We can see the strong relation between these two factors.

Cache Miss Ratio (SGI) Cache Miss Ratio = Module Cache Serial OpenMP
1*1) 2 4 8 CMFD (BICG) L1 1.00 1.85 1.93 1.60 4.19 0.99 2.09 1.71 1.66 1.00 0.98 2.73 1.08 0.95 3.05 3.44 2.85 6.39 1.48 3.52 2.53 2.40 4.52 5.34 4.22 8.33 2.07 5.49 3.67 3.03 L2 1.00 Nodal L1 1.00 L2 1.00 T/H (TRTH) L1 1.00 L2 1.00 XSEC Cache miss ratio is defined by cache misses of serial execution divided by cache misses of parallel execution. L2 cache miss ratio of each module for 2 threads, is very close to the measured speedup Except TH module. For TH module, L1 cache is also important factor. L1 1.00 L2 1.00 *1) Number of Threads Cache Miss Ratio =

Speedup Estimation Using Cache Misses
where = Total data access time for serial execution = Total data access time for 2 threads execution. Speedup where = Total L2 cache access time = Total memory access time = Number of L1 data cache misses satisfied by L2 cache hit = Number of L2 data cache misses satisfied from main memory = L2 cache access time for 1 word = Main memory access time for 1 word. Data Access Time We developed this simple model to estimate speedup based on the cache misses Total serial memory access time divided by 2 threads memory access time Is the estimated speedup. And each memory access time is estimated based on the cache misses.

Estimated 2-thread Speedup Based on Data Cache Misses for OpenMP on SGI
Module Speedup Measured Predicted CMFD (BICG) 1.63 1.78 Nodal 1.55 1.80 Predicted speedup agrees well with the measured speedup. This results mean that the memory access time is a dominant factor for performance And the memory access time can be reasonably estimated by cache misses. T/H (TRTH) 2.17 2.04 XSEC 2.01 1.86

Conclusions

Conclusions Comparison of OpenMP and POSIX Threads Cache Analysis
OpenMP is comparable to POSIX Threads in terms of Parallel Performance OpenMP is much easier to Implement than POSIX Threads due to the Directive based Nature Cache Analysis The Prediction of Speedup based on Data Cache Misses Agrees well with the Measured Speedup The parallel performance of OpenMP is comparable to POSIX threads. And OpenMP can be implemented more easily than POSIX Threads. So, for parallel PARCS, OpenMP is a preferred choice. The predicted speedup based on cache performance agrees well with the measured speedup. This means that a modern CPU is so fast that memory access is the bottle-neck, in other words, dominant factor for performance more than CPU speed .

Continuing Work Algorithmic 3-D Domain Decomposition Software
SUN Compiler Pthreads Scheduling on SGI Alternate Platforms Continuing work is 3-dimensional domain decomposition for load balancing. And SUN FORTRAN compiler with OpenMP is still a question and needs more investigation. And POSIX Threads scheduling on SGI machine is also a question. We are going to test the parallel PARCS on a different platform, such as DEC ALPHA machine and Linux PC.

School of Nuclear Engineering

Similar presentations

Presentation on theme: "School of Nuclear Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Nuclear Engineering

Similar presentations

Presentation on theme: "School of Nuclear Engineering"— Presentation transcript:

Similar presentations

About project

Feedback