Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Message Passing Fundamentals.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
The hybird approach to programming clusters of multi-core architetures.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Parallel Computing.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Outline Why this subject? What is High Performance Computing?
EKT303/4 Superscalar vs Super-pipelined.
Parallel processing
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to parallel programming
Distributed Processors
CS 147 – Parallel Processing
CRESCO Project: Salvatore Raia
Introduction to Parallelism.
Load Balancing: List Scheduling
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Parallel Programming in C with MPI and OpenMP
CSE8380 Parallel and Distributed Processing Presentation
Distributed computing deals with hardware
AN INTRODUCTION ON PARALLEL PROCESSING
Hybrid Programming with OpenMP and MPI
Load Balancing: List Scheduling
Presentation transcript:

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui

Outline Numerical methods Modern programming models DG method: Implementation and scalability

Outline Numerical methods Modern programming models DG method: Implementation and scalability

Classical approaches Finite difference:

Classical approaches Finite volume: oldwww.unibas.it

Limitations The finite volume method is a low order method. The approximate solution is piecewise constant. Very fine mesh = High number of degrees of freedom = Large linear system.

DG-Finite Element Method Allows us to use higher order approximation. Allows the modelling of complex geometries. The modern methods such as the DG method allows the implementation of hp-refinement in a relatively easy way. p=2 p=1 p=3

DG-Finite Element Method Allows us to use higher order approximation. Allows the modelling of complex geometries. The modern methods such as the DG method allows the implementation of hp-refinement in a relatively easy way.

Outline Numerical methods Modern programming models DG method: Implementation and scalability

Serial Computers Serial Computer Memory Unit Central Processing Unit (CPU) 1 Central Processing Unit (CPU). 1 Memory Unit.

From Serial to Parallel: Step I Idea: Add more cores! => Multi-core processor/CPU Architecture: Uniform memory access (UMA) UMA Node Memory Unit Central Processing Unit (CPU) Core Speed A

From Serial to Parallel: Step II Idea: Add more processors => Multi-processor nodes Architecture: Non-uniform memory access (NUMA) NUMA Node Memory Unit Central Processing Unit (CPU) Core Central Processing Unit (CPU) Core Speed A Speed B Speed A > Speed B

From Serial to Parallel: Step III Idea: Connect nodes by network (actual wires) Result: The majority of supercomputers around Architecture: Interconnected NUMA nodes … NUMA Node Speed С Speed A > Speed B > Speed С

Outline Numerical methods Modern programming models DG method: Implementation and scalability

Domain Decomposition and SPMD Single program, Multiple data (SPMD) Most common style of parallel programming Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster. Same program is executed on every processor

Domain Decomposition Core 1Core 2 Ghost region

Domain Decomposition of The FE Method Core 1

Domain Decomposition of The FE Method Core 2Core 1 MPI

Load Balance The domain decomposition is done by elements. Assign weights to the elements to ensure load balance. p=2 p=1 p=3

Strong Scalability CRAY machine: 52 nodes with 2 CPUs =>Total number of cores = 1040 We use Hypre* to solve the linear system. *

Strong Scalability CRAY machine: 52 nodes with 2 CPUs =>Total number of cores = 1040 We use Hypre* to solve the linear system. *

Weak Scalability

Evolution of Supercomputers: GPUs Idea: Complement CPUs with accelerators/co-processors Result: The biggest supercomputers today. Architecture: Hybrid … NUMA Node Speed С GPU CPU NUMA Node GPU CPU NUMA Node GPU CPU NUMA Node GPU CPU

Domain Decomposition of The FE Method Node 1

Domain Decomposition of The FE Method Node 2Node 1 MPI

Scalability of The Hybrid Implementation I Comparison between HYPRE and AMGX made using 2 CPUs per node for HYPRE and one Tesla K40 GPU per node for AMGX.

NUMA Node Central Processing Unit (CPU) Drawbacks Core GPU SUBDOMAIN i Uniform Access Linear system

NUMA Node Optimized Implementation: OpenMP Central Processing Unit (CPU) Core GPU SUBDOMAIN i Access Linear system OpenMP

Scalability of The Hybrid Implementation II

Conclusion We were able to develop a very scalable software that takes into account modern technology to simulate geophysical applications. hp-refinement is fairly easy as a result of using DG method. Load balancing is ensured using parmetis.