I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
LECTURE SERIES on STRUCTURAL OPTIMIZATION Thanh X. Nguyen Structural Mechanics Division National University of Civil Engineering
Application: TSD-MPI Calculation of Thermal Stress Distribution By Using MPI on EumedGrid Abdallah ISSA Mazen TOUMEH Higher Institute for Applied Sciences.
OpenFOAM on a GPU-based Heterogeneous Cluster
By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :
Nonlinearity Structural Mechanics Displacement-based Formulations.
Isoparametric Elements Element Stiffness Matrices
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
MCE 561 Computational Methods in Solid Mechanics
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
An introduction to the finite element method using MATLAB
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
Using shadow prices in a linear programing representation of Kanban system dynamics to maximize system throughput George Liberopoulos  Kostas Takoumis.
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
MSc Thesis presentation – Thijs Bosma – December 4th 2013
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
QCAdesigner – CUDA HPPS project
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
DEWEK 2004 Lecture by Aero Dynamik Consult GmbH, Dipl. Ing. Stefan Kleinhansl ADCoS – A Nonlinear Aeroelastic Code for the Complete Dynamic Simulation.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Finite Elements in multi-dimensions
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Generalized and Hybrid Fast-ICA Implementation using GPU
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A survey of Exascale Linear Algebra Libraries for Data Assimilation
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Finite Element Method in Geotechnical Engineering
Parallel Plasma Equilibrium Reconstruction Using GPU
Ioannis E. Venetis Department of Computer Engineering and Informatics
CS427 Multicore Architecture and Parallel Computing
GPU Computing CIS-543 Lecture 10: CUDA Libraries
LinkSCEEM-2: A computational resource for the Eastern Mediterranean
Boundary Element Analysis of Systems Using Interval Methods
Enabling machine learning in embedded systems
ANDRÉS ALONSO-RODRIGUEZ Universidad de Valparaíso, Chile
Raquel Ramos Pinho, João Manuel R. S. Tavares
PreOpenSeesPost: a Generic Interface for OpenSees
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Linchuan Chen, Peng Jiang and Gagan Agrawal
GPU Implementations for Finite Element Methods
Implementation of 2D stress-strain Finite Element Modeling on MATLAB
FEM Steps (Displacement Method)
GENERAL VIEW OF KRATOS MULTIPHYSICS
Parallel programming with GPGPU coprocessors
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Coordinator: DKRZ Weather Climate HPC
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
6- General Purpose GPU Programming
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Presentation transcript:

Towards the Implementation of Wind Turbine Simulations on Many-Core Systems I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2 1University of Patras, Greece 2Embry-Riddle Aeronautical University, FL, USA

Many systems modelled by PDEs To simulate on a computer Introduction Many systems modelled by PDEs To simulate on a computer Discretization of the underlying PDEs Finite Element Method (FEM) Construct system of linear or non-linear equations Solve system of equations Typically very time consuming Use of HPC systems

Accelerate FSI simulations of next generation wind turbine blades Target Accelerate FSI simulations of next generation wind turbine blades FSI application by J. A. Ekaterinaris Use GPU computing power to reduce execution time

Typical FEM Workflow Discretization of the application domain by applying a grid of elements Numerical integration which includes calculation of the local stiffness matrix (LSM) for each element Matrix assembly for constructing the global stiffness matrix from the local matrices Repeat Solve the system of linear equations described by the large, sparse matrix computed in the previous step Improvements for calculating LSMs have not much impact in the overall execution time

Wind Turbine Simulation Application Next generation wind turbines are large Wind blowing applies forces Causes deformation of blades that cannot be ignored anymore Causes movement of turbine that cannot be ignored anymore Parts of turbine do not correspond to elements from discretization Simulation results are not accurate Solution Local stiffness matrix for each element has to be recalculated after each time step

Workflow in Wind Turbine Simulation Application Discretization of the application domain by applying a grid of elements Repeat Numerical integration which includes calculation of the local stiffness matrix (LSM) for each element Matrix assembly for constructing the global stiffness matrix from the local matrices Solve the system of linear equations described by the large, sparse matrix computed in the previous step Accelerating construction of LSM is worth the effort

Recent activity

GPUs have evolved into extremely flexible and powerful processors GPUs as accelerators GPUs have evolved into extremely flexible and powerful processors Contemporary GPUs provide large numbers of cores 2880 cores on NVidia Tesla K40 High throughput to cost ratio NVidia GPUs Programmable by using CUDA Extensions to industry standard programming languages

GPUs as accelerators

LSM calculations on the GPU Calculation of the LSM of each element does not depend on other calculations Ideal candidate for computing on the GPU Typically there is a large number of elements Can naturally be handled by the programming model of the GPU Might be insufficient memory to store all the elements on the GPU

LSM construction pseudocode Hexahedral elements Second order expansion NVB = 27 NPQ = 5 // Iterate over all elements for (k = 0; k < elnum; k++){ // iterate over polynomial bases for (m = 0; m < NVB; m++) { for (n = 0; n < NVB; n++) { row , col = getrowcol(m,n); // iterate over integration points for (x = 0; x < NQP; x++){ for (y = 0; y < NQP; y++){ for (z = 0; z < NQP; z++){ el[k].lsm[row][col] += "elasticity equation" }}} // x, y, z }} // m, n } // k

Mapping of calculations on the GPU

Improvements introduced for a single GPU Overlap calculations with data transfers from/to host

Improvements introduced for a single GPU All valid mappings of the loop for given number of threads for our configuration have been tested Input data reordered in memory to become GPU memory-friendly

Results for single GPU approach Approach provides large improvement in execution time LSM calculations only: Up to 98.1% Total execution time: Up to 76.2% Extension: single GPU  MultiGPU  MultiNode

Multi-GPU

Multi-Node & Multi-GPU

Computing platform We thank the LinkSCEEM-2 project, funded by the European Commission under the 7th Framework Programme through Capacities Research Infrastructure, INFRA-2010-1.2.3 Virtual Research Communities, Combination of Collaborative Project and Coordination and Support Actions (CP-CSA) under grant agreement no RI-261600.

Parameters of application/experiments 8 degrees of freedom 4 cases for number of elements 256, 1024, 4096, 16384 Used up to 8 nodes of the cluster Total up to 16 GPUs

Speedup of LSMs calculation using 1 GPU per node

Speedup of LSMs calculation using 2 GPUs per node

Special case: 65536 elements Does not fit into memory of 1 GPU But does fit into memory of 2 GPUs Speedup against execution time on 1 node using 2 GPUs

LSM calculations are highly parallelizable Conclusion LSM calculations are highly parallelizable Significant overall improvement in execution time

Execute on larger cluster Allow large numbers of elements Future Work Execute on larger cluster Allow large numbers of elements Elements do not fit into GPU memory Reorganize representation of elements in memory to better fit architectural characteristics of GPUs Parallelize more functions Include CUDA parallel solver Currently PETSc is used for this purpose Available solvers for CUDA seem to have poor performance

Acknowledgements This research has been co-financed by the European Union (European Social Fund -ESF) and Greek national funds through the Operational Program Education and Lifelong Learning of the National Strategic Reference Framework (NSRF) - Research Funding Program: THALES: Reinforcement of the interdisciplinary and/or inter - institutional research and innovation, (MIS-379421, ”Expertise development for the aeroelastic analysis and the design-optimization of wind turbines”). Support by the LinkSCEEM-2 project, funded by the European Commission under the 7th Framework Programme through Capacities Research Infrastructure, INFRA-2010-1.2.3 Virtual Research Communities, Combination of Collaborative Project and Coordination and Support Actions (CP-CSA) under grant agreement no RI-261600.