Towards the Implementation of Wind Turbine Simulations on Many-Core Systems I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2 1University of Patras, Greece 2Embry-Riddle Aeronautical University, FL, USA
Many systems modelled by PDEs To simulate on a computer Introduction Many systems modelled by PDEs To simulate on a computer Discretization of the underlying PDEs Finite Element Method (FEM) Construct system of linear or non-linear equations Solve system of equations Typically very time consuming Use of HPC systems
Accelerate FSI simulations of next generation wind turbine blades Target Accelerate FSI simulations of next generation wind turbine blades FSI application by J. A. Ekaterinaris Use GPU computing power to reduce execution time
Typical FEM Workflow Discretization of the application domain by applying a grid of elements Numerical integration which includes calculation of the local stiffness matrix (LSM) for each element Matrix assembly for constructing the global stiffness matrix from the local matrices Repeat Solve the system of linear equations described by the large, sparse matrix computed in the previous step Improvements for calculating LSMs have not much impact in the overall execution time
Wind Turbine Simulation Application Next generation wind turbines are large Wind blowing applies forces Causes deformation of blades that cannot be ignored anymore Causes movement of turbine that cannot be ignored anymore Parts of turbine do not correspond to elements from discretization Simulation results are not accurate Solution Local stiffness matrix for each element has to be recalculated after each time step
Workflow in Wind Turbine Simulation Application Discretization of the application domain by applying a grid of elements Repeat Numerical integration which includes calculation of the local stiffness matrix (LSM) for each element Matrix assembly for constructing the global stiffness matrix from the local matrices Solve the system of linear equations described by the large, sparse matrix computed in the previous step Accelerating construction of LSM is worth the effort
Recent activity
GPUs have evolved into extremely flexible and powerful processors GPUs as accelerators GPUs have evolved into extremely flexible and powerful processors Contemporary GPUs provide large numbers of cores 2880 cores on NVidia Tesla K40 High throughput to cost ratio NVidia GPUs Programmable by using CUDA Extensions to industry standard programming languages
GPUs as accelerators
LSM calculations on the GPU Calculation of the LSM of each element does not depend on other calculations Ideal candidate for computing on the GPU Typically there is a large number of elements Can naturally be handled by the programming model of the GPU Might be insufficient memory to store all the elements on the GPU
LSM construction pseudocode Hexahedral elements Second order expansion NVB = 27 NPQ = 5 // Iterate over all elements for (k = 0; k < elnum; k++){ // iterate over polynomial bases for (m = 0; m < NVB; m++) { for (n = 0; n < NVB; n++) { row , col = getrowcol(m,n); // iterate over integration points for (x = 0; x < NQP; x++){ for (y = 0; y < NQP; y++){ for (z = 0; z < NQP; z++){ el[k].lsm[row][col] += "elasticity equation" }}} // x, y, z }} // m, n } // k
Mapping of calculations on the GPU
Improvements introduced for a single GPU Overlap calculations with data transfers from/to host
Improvements introduced for a single GPU All valid mappings of the loop for given number of threads for our configuration have been tested Input data reordered in memory to become GPU memory-friendly
Results for single GPU approach Approach provides large improvement in execution time LSM calculations only: Up to 98.1% Total execution time: Up to 76.2% Extension: single GPU MultiGPU MultiNode
Multi-GPU
Multi-Node & Multi-GPU
Computing platform We thank the LinkSCEEM-2 project, funded by the European Commission under the 7th Framework Programme through Capacities Research Infrastructure, INFRA-2010-1.2.3 Virtual Research Communities, Combination of Collaborative Project and Coordination and Support Actions (CP-CSA) under grant agreement no RI-261600.
Parameters of application/experiments 8 degrees of freedom 4 cases for number of elements 256, 1024, 4096, 16384 Used up to 8 nodes of the cluster Total up to 16 GPUs
Speedup of LSMs calculation using 1 GPU per node
Speedup of LSMs calculation using 2 GPUs per node
Special case: 65536 elements Does not fit into memory of 1 GPU But does fit into memory of 2 GPUs Speedup against execution time on 1 node using 2 GPUs
LSM calculations are highly parallelizable Conclusion LSM calculations are highly parallelizable Significant overall improvement in execution time
Execute on larger cluster Allow large numbers of elements Future Work Execute on larger cluster Allow large numbers of elements Elements do not fit into GPU memory Reorganize representation of elements in memory to better fit architectural characteristics of GPUs Parallelize more functions Include CUDA parallel solver Currently PETSc is used for this purpose Available solvers for CUDA seem to have poor performance
Acknowledgements This research has been co-financed by the European Union (European Social Fund -ESF) and Greek national funds through the Operational Program Education and Lifelong Learning of the National Strategic Reference Framework (NSRF) - Research Funding Program: THALES: Reinforcement of the interdisciplinary and/or inter - institutional research and innovation, (MIS-379421, ”Expertise development for the aeroelastic analysis and the design-optimization of wind turbines”). Support by the LinkSCEEM-2 project, funded by the European Commission under the 7th Framework Programme through Capacities Research Infrastructure, INFRA-2010-1.2.3 Virtual Research Communities, Combination of Collaborative Project and Coordination and Support Actions (CP-CSA) under grant agreement no RI-261600.