JAXA 内製 MPS 法プログラム P-Flow による大規模流体解析

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Extracted directly from:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Sunpyo Hong, Hyesoon Kim
Parallel Computing Presented by Justin Reschke
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
Lynn Choi School of Electrical Engineering
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Parallel Algorithm Design
Porting DL_MESO_DPD on GPUs
Real-Time Ray Tracing Stefan Popov.
Prof. Zhang Gang School of Computer Sci. & Tech.
Lecture 2: Intro to the simd lifestyle and GPU internals
Parallel Programming in C with MPI and OpenMP
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Multicore and GPU Programming
Parallel Programming in C with MPI and OpenMP
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

JAXA 内製 MPS 法プログラム P-Flow による大規模流体解析 EX18305 (東京大学情報基盤センター推薦課題) 宮島 敬明 理化学研究所 計算科学研究センター JAXA 内製 MPS 法プログラム P-Flow による大規模流体解析 I. MPS method and our in-house code; P-Flow III. Porting and optimization strategy In order to make maintenance easy and keep the structure of original code, OpenACC and OpenMP are used for GPU and CPU, respectively. OpenACC works well on the complex data structure. Moving Particle Semi-implicit (MPS) method MPS method is a sort of particle methods used for computation fluid dynamics. It is originally developed for simulating fluid dynamics such as fragmentation of incom-pressible fluid. Target fluid or objects are divided into particles and each particle in-teracts with neighbor-particles. Search of neighbor-particles is a main bottleneck of MPS method. We’re researching and deve-loping in-house program. MPS simulation: A collapse of water column P100: Bucket per thread block, particles per CUDA thread Each target bucket (Loop1) is assigned to thread block. Each particle in the same bucket (Loop2) is assigned to CUDA thread. Loop 3 and 4 are processed sequentially by each CUDA thread. Memory access of threads in bucket becomes the same since each particle in the same bucket accesses the same particle in loop 3 and 4. But not coalesced and low bandwidth utilization. Utilization of CUDA thread is low since the # of particle in bucket often becomes less than 32. Particles in the same bucket 1 2 3 4 26 … 32 threads / block Step 3 and 4 Complexity of data structure and memory access pattern Our in-house MPS code; P-Flow adopts complex data structure for scalability and extensibility. It is written in Fortran and utilizes multiple de-rived types. For bucket management (AoS) int :: num bktlist_array type(bktlist), dim(:), allocatable :: array int, dim(2,3) :: minmax 4B 312B 24B int :: ib bktlist int :: rank int, dim(2,3) :: next type(bucket) :: dat 280B type(bucket_particles) :: pc type(bucket_wall) :: wall bucket 8B 16B type(bucket_aero) :: aero(8) 256B int :: Id_lc bucket_particle fp :: cg(3) fp :: rhog bucket_aero int :: near int :: face_id bucket_wall fp ::dist KNL and Skylake: Bucket per logical thread Each target bucket (Loop1) is assigned to logical thread. Loop 2~4 are processed sequentially by each thread. OpenMP’s “schedule (dynamic)” clause is used for load balancing since the # of particles in bucket is different. Memory access of threads is different and cache utilization is very low. Vectorization is not done as well. Different algorithm (e.g. Verlet list in molecular dynamics) is required to solve these problems. KNL and Skylake achieved higher performance when Hyper-Threading is enabled. Bucket 1 2 3 4 26 … Particles in bucket Bucket / thread For physical values (SoA) fp, allocatable :: p fp, allocatable :: c fp, allocatable :: x fp, allocatable :: nden index calculation (ld_lc + num) II. Search for neighbour-particle MPS performs search for neighbor-particle multiple times in each time step (e.g. density calculation). Each particle drifts as time-step progress and neighbor-particle changes. Equation of density IV. Performance on P100, KNL and Skylake Single prec. [TFLOPS] Clock (Bost) [GHz] Number of Threads Mem BW [Gbps] Single-KNL-7210 5.3 1.3 (1.5) 64 * 4HT = 256 480 Two-Xeon Gold 6150 N/A 2.7 (3.7) 18 * 2HT * 2CPUs = 72 256 Two-P100 (PCIe) 9.3 1.1 (1.3) 3,584 * 2GPUs = 7,168 732 Two-P100 (NVlink) 10.6 1.3 (1.4) Search for neighbour-particle with bucket The following 4-nested loop step is used. 1. Choose a target bucket 2. Pickup a target particle (red) in the bucket 3. Traverse 3*3*3 adjacent buckets (in no particular order) 4. Search particles in a bucket 4-1. Calculate distance and weight between the target particle 4-2. Accumulate weighted physical value to a target particle (in no particular order) Maximum number of particles in a bucket is 33 10 35 2 4 34 14 5 31 3 11 32 33 36 1 13 12 Traversed bucket Chosen bucket 6 15 N : N-th particle : Adjacent buckets Data sets Collapse of water column (40[cm]×40[cm]×8[cm]) # of particles: 224,910, #of buckets: 70x70x14 Average time of 200 time step of particle density computation Compilers and configurations P100: PGI Compiler 17.7 w/ “-ta=nvidia:cc60,cuda8.0,fastmath” KNL7210 (Flat+Quadrant, 4HT) and Xeon Gold 6150 (2HT): Intel Compiler 2017 w/ “ –qopenmp -O3 -xCOMMON-AVX512 -fp-model fast=2” Search for neighbor-particle Projected number of two-KNL (104.2[ms] for single-KNL) Memory access pattern and Characteristics Indefinite loop: number of particles in a bucket is uncertain Vectorization: Each target particle accesses different bucket and particle Cache: Not easy to utilize cache since adjacent particles changes time by time Parallelism: Thousands of in-flight data requests can hide memory access latency loop-3. Traversed adjacent buckets loop-4. Particles in the traversed bucket 1 2 3 11 10 loop-4-1,2. Calculate physical quantity 12 loop-2. Particles in the chosen bucket 13 loop-1. Choose bucket : Bucket structure array : Particle index array : Bucket structure array : Particle position array : Particle quantity array In-direct accesses Indefinite loop Accumulation P100 is x4.5 faster than Skylake Performance is different when host processor is different