A FAST ULTRASONIC SIMULATION TOOL BASED ON MASSIVELY PARALLEL IMPLEMENTATIONS Jason LAMBERT 1, Gilles ROUGERON 1, Lionel LACASSAGNE 2, and Sylvain CHATILLON.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
EPOCH 1000 Series Procedure Phased Array DGS/AVG
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
ADVANCED SIMULATION OF ULTRASONIC INSPECTION OF WELDS USING DYNAMIC RAY TRACING Audrey GARDAHAUT (1), Karim JEZZINE (1), Didier CASSEREAU (2), Nicolas.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.
Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on an embedded system School of Information Technology & Mathematical Sciences September.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Scaling up R computation with high performance computing resources.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Chalmers University of Technology Advanced NDT Mathematical modelling of the ultrasonic phased array technique within the project Quantification of the.
Use of Ultrasonic Phased Arrays for Examination of Austenitic Steel Welds Santanu Saha Technical Manager, Non-Destructive Testing Intertek INSPEC.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction to threads
Productive Performance Tools for Heterogeneous Parallel Computing
Generalized and Hybrid Fast-ICA Implementation using GPU
Parallel Plasma Equilibrium Reconstruction Using GPU
CS427 Multicore Architecture and Parallel Computing
Genomic Data Clustering on FPGAs for Compression
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Map-Scan Node Accelerator for Big-Data
Ray-Cast Rendering in VTK-m
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Multithreaded Programming
6- General Purpose GPU Programming
Presentation transcript:

A FAST ULTRASONIC SIMULATION TOOL BASED ON MASSIVELY PARALLEL IMPLEMENTATIONS Jason LAMBERT 1, Gilles ROUGERON 1, Lionel LACASSAGNE 2, and Sylvain CHATILLON 1. 1 CEA, LIST, F Gif-sur-Yvette, France 2 LRI, Université Paris-Sud, F Orsay cedex, France QNDE SESSION 43 | JULY 26, 2013 | PAGE 1

SOMMAIRE Context and objectives Description of the UT simulation tools Beam computation Defect diffraction Algorithm of a simulation Description of CPU-multicore and GPU implementations CPU and GPU specificities Model validation with CIVA 10.1 Performances results QNDE SESSION 43 | JULY 26, 2013 | PAGE 2

QNDE SESSION 43 | JULY 26, 2013 | PAGE 3 CONTEXT AND OBJECTIVES Objectives : fast UT inspection simulation Inspection designing tools (probe especially phased array, scanning, …) POD curves computation, Data Base computation, inversion procedure Means : exploitation of CPU-multicores and GPU capabilities Intensively parallel computation (beam propagation and diffraction with defect) New model based on analytical solution (beam propagation) Use of high performances signal processing libraries (Intel MKL/NVidia CUFFT) Limitations : canonical configurations Simple geometries described by planar surfaces (planar, 2D CAD) Homogeneous and isotropic materials Planar defects (Kirchhoff approximation) Half-skip

Optimization for isotropic and homogeneous planar component Superposition of elementary contributions over the probe t h(M,t) Computation of the impulse response UT BEAM SIMULATION USING THE PARAXIAL RAY TRACING METHOD Observation point Source point 1 - Analytical expression for ray paths computation (direct and after backwall reflection with mode conversion) Regular Probe discretization Ray path optimization (no ray tracing) 2 - Analytical expression of the amplitude of the contribution (no matrix computation) Divergence factor Reflection and refraction coefficients Extraction of beam characteristics for echo simulation Amplitude and time of flight Direction of propagation and polarization QNDE SESSION 43 | JULY 26, 2013 | PAGE 4

QNDE SESSIO N 43 | JULY 26, 2013 M Generic formulation : Generic formulation : Transmission Beam / Flaw scattering / Reception beam (by Auld’s reciprocity ) over the flaw discretization Incident beam Amplitude : q E (M) Time of flight : T E Incident direction incident polarization Amplitude : q R (M) Time of Flight : T R Observation direction Observation polarization « Received » Beam  M q R (M)q E (M){K(M,t).e i  .e i  R }.S(t-T E -T R )} S ER (t)= Flaw meshing Kirchhoff diffraction coefficient, valid near specular reflection K(M,t) PRINCIPLE OF UT DIFFRACTION SIMULATION Synthesis of the received signal using Auld’s reciprocity principle | PAGE 5

ALGORITHM OF A DIFFRACTION SIMULATION QNDE SESSION 43 | JULY 26, 2013 | PAGE 6

OVERVIEW OF CPU AND GPU PLATFORM Multicore CPU Small number of cores (10s) Cores can work on different tasks (task parallelism) Core can handle complex task GPU Numerous smaller cores (100s) A same task executed on multiple data by multiple cores ( data parallelism ) Memory access shared across cores Coprocessor : separate memory from host CPU QNDE SESSION 43 | JULY 26, 2013 | PAGE 7

CPU IMPLEMENTATION Algorithm optimized for coarse grained parallelism High level loops are multithreaded with OpenMP API at each step Memory quite abundant on CPU so results are reused (field impulse responses) Signal processing use Intel Math Kernel Library (MKL) Need to evaluate the time gate of signals to initialize only once FFT operations Calls to FFT are made by threads when needed, without any synchronization QNDE SESSION 43 | JULY 26, 2013 | PAGE 8

GPU CONSIDERATIONS GPU are aimed at data parallel computations Analytical and specialized approach over genericity Control flow optimized for executing the same task over multiple data, in subroutines called “kernels” Benefits from GPU efficient collaboration of nearby threads (shared memory) cuFFT library optimized for batches of multiple signals GPU are coprocessor driven by their host CPU Memory allocation should be done prior to executing GPU kernels Memory management, kernel launches and cuFFT calls are made asynchronously by the host GPU have less device memory than CPU have commonly QNDE SESSION 43 | JULY 26, 2013 | PAGE 9

GPU KERNEL REALIZATION Signal size determination Rely previous work for analytical ray path determination for time of flight computation done for the Total Focusing Method [1] Size is sent back to host to allocate memory on the GPU UT Field computation – summation of probe elementary contributions Contributions are recomputed here, cheaper than storing and retrieving from memory Signals resulting from the summation too big to fit in GPU memory → work divided along transducer positions and temporary data will be overwritten Response computation by Kirchhoff Model Computed once on the whole dataset (every positions & delay law) Synthetic inputs ( direction, amplitude & time of flight) will fit in the GPU memory Allows cuFFT to work on a batch of signals for the last convolution Size of resulting echo signal given by the user in input data QNDE SESSION 43 | JULY 26, 2013 | PAGE 10 [1] Implementation of a GPU Accelerated Total Focusing Reconstruction Method within Civa Software ---Gilles Rougeron, Jason Lambert, and Ekaterina Iakovleva, Lionel Lacassagne, QNDE 2013, Session 28

PARALLEL ALGORITHMS OVERVIEW QNDE SESSION 43 | JULY 26, 2013 | PAGE 11

Ø3mm FBH Serie n°1 Serie n°2 Serie n°1 Serie n°2 Serie n°1 FBHs Ø3mm 45° Front view MODEL VALIDATION ON QNDE BENCHMARK 2013 Flat bottom holes : Planar surface, L direct echo, homogeneous isotropic material (steel) Series1 of 10 Ø3mm FBH 1 out of 7 delay law : delay law L45° + 25mm depth focusing Reference : CIVA, validated against the benchmark QNDE SESSION 43 | JULY 26, 2013 | PAGE 12 0,6dB 0,2dB 0,7dB Time of flight OK CPU GPU Z= mm CIVA

Flat bottom holes : Planar surface, L direct echo, homogeneous isotropic material (steel) Series2 of 6 Ø3mm FBH 1 out of 7 delay law : delay law L45° + 35mm depth focusing Ø3mm FBH Serie n°1 Serie n°2 Serie n°1 Serie n°2 Serie n°1 FBHs Ø3mm 45° Front view MODEL VALIDATION ON QNDE BENCHMARK 2013 QNDE SESSION 43 | JULY 26, 2013 | PAGE 13 0,8dB 0,6dB Time of flight OK Z= mm CPU GPU CIVA

MODEL VALIDATION ON QNDE BENCHMARK 2013 Planar defects (backwall breaking notches) Planar surface, corner echo, homogeneous isotropic material (steel) Current implementation : L only → only the biggest notch (L mode distinct from others) 1 out of 4 delay law : delay law L45° + 30mm depth focusing QNDE SESSION 43 | JULY 26, 2013 | PAGE 14 GPU 0.9dB CPU 0.6dB CPU GPU CIVA Time of flight OK

DATASETS BREAKDOWN, QNDE SESSION 43 | JULY 26, 2013 | PAGE 15 DatasetFBH - Series 1FBH – Series 2Notch # of defects10 FBH6 FDH1 notch Probe64 channels ↔ 128 samples # of positions # delay laws74 # UT field sampling # defects sampling ModeL direct Half skip – L

PERFORMANCES – MULTICORE CPU VERSION Hardware configuration 2x CPU Intel Xeon each + Hyperthreading (2x12 logical cores) 24 GB of memory CPU implementation specificities Use optimized Intel Math Kernel Library for FFT computations Each step is parallelized with OpenMP API Simple precision computations (IEEE 754) Compiler : Intel® C++ Composer XE 2013 Update 5 Integration for Microsoft Visual Studio 2010, Version This machine is also the one running CIVA 11.0 QNDE SESSION 43 | JULY 26, 2013 | PAGE 16

PERFORMANCES – MULTICORE CPU DatasetFBH - Series 1FBH – Series 2Notch Fast model on CPU7,6 s2 s7,3 s CIVA s142 s363 s Gain / CIVA 11.0x31,7x71x49,7 QNDE SESSION 43 | JULY 26, 2013 | PAGE 17 Signal processing operations remain the most costly, especially for one echo mode

PERFORMANCES – GPU Hardware configuration NVidia GeForce GTX580 (high end, consumer grade GPU) 512 CUDA each Fermi architecture 1,5 GB of device memory GPU implementation specificities CUDA toolkit version 5.0 on Windows Simple precision computations (IEEE 754) Execution time do not include GPU initialization nor the transfer of the result from the device back to the host (at most 200ms on the studied cases) QNDE SESSION 43 | JULY 26, 2013 | PAGE 18

PERFORMANCES – GPU DatasetFBH - Series 1FBH – Series 2Notche GPU version5,8 s1,7 s3,6 s CPU version7,6 s2 s7,3 s GainX1,3X1,17x2 QNDE SESSION 43 | JULY 26, 2013 | PAGE 19 Signal processing is costly : -FFT over big memory chunks -Summation require Atomic Operation Signal processing operations remain the most costly, especially for one echo mode

CONCLUSIONS A fast UT inspection simulation tool has been developed. Implementation on CPU and GPU validated on QNDE benchmark 2013 (comparison with CIVA) Acceleration ranging on real data CPU (12 cores + hyperthreading) : from x31 to x70 over CIVA 11 GPU (512 cores) : from x1.17 to x 2.67 over CPU (12 core + HT) from x41 to x 101 over CIVA 11 QNDE SESSION 43 | JULY 26, 2013 | PAGE 20

PERSPECTIVES QNDE SESSION 43 | JULY 26, 2013 | PAGE 21

QNDE SESSION 43 | JULY 26, 2013 | PAGE 22 Thank you for your attention Partially supported by ANR (OPARUS Projet)