The Poisson operator, aka the Laplacian, is a second order elliptic differential operator and defined in an n-dimensional Cartesian space by: The Poisson.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Chapter 8 Elliptic Equation.

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

ASCI/Alliances Center for Astrophysical Thermonuclear Flashes Simulating Self-Gravitating Flows with FLASH P. M. Ricker, K. Olson, and F. X. Timmes Motivation:

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Processing of a CAD/CAE Jobs in grid environment using Elmer Electronics Group, Physics Department, Faculty of Science, Ain Shams University, Mohamed Hussein.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.

Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Parallel Solution of the Poisson Problem Using MPI

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel Computing Presented by Justin Reschke

Background Computer System Architectures Computer System Software.

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Relaxation Methods in the Solution of Partial Differential Equations

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Objects: Virtualization & In-Process Components

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Operating System Concepts

Chapter 4: Threads.

GENERAL VIEW OF KRATOS MULTIPHYSICS

Supported by the National Science Foundation.

Multithreaded Programming

Immersed Boundary Method Simulation in Titanium Objectives

Presentation transcript:

The Poisson operator, aka the Laplacian, is a second order elliptic differential operator and defined in an n-dimensional Cartesian space by: The Poisson operator appears in the definition of the Helmholtz differential equation: The Helmholtz differential equation reduces to the Poisson equation: The Poisson equation is used in the modeling of various boundary value physical problems: e.g. electric potential in electrostatics, potential flow in fluid dynamics etc. The definition of the appropriate boundary conditions, Dirichlet and Neumann, allows for the solution of the Poisson problem. A numerical solution requires the discretization of the continuous Poisson’s equation, e.g. by the standard centered-difference approximation, as well as a discrete handling of the Dirichlet and Neumann boundary conditions. The discrete Poisson operator, the focus of this project, is given by the following stencil: Contact information: Meriem Ben Salah: UC Berkeley ME Graduate Student, ParLab, Razvan Corneliu Carbunescu: UC Berkeley CS Graduate Student, ParLab, Andrew Gearhart: UC Berkeley CS Graduate Student, ParLab, James Demmel:UC Berkeley Math & CS Faculty, Phillip Colella:LBNL ANAG, Brian Van Sraalen: LBNL ANAG, CHOMBO is a framework for implementing finite difference methods for the solution of partial differential equations on block structured adaptively refined rectangular grids. CHOMBO provides elliptic and time-dependent modules, as well as support for standardized self-describing file formats. Chombo is architecture and operating system independent.  For the use on parallel platforms, CHOMBO provides solely a distributed memory implementation in MPI (Message Passing Interface). Is the use of a distributed memory implementation always beneficial? Theoretical Background Targeted Architectures Investigations and Results Conclusions and Future Work Project Strategy and Goals Poisson Operator in CHOMBO Motivation For the sake of a demonstrative exposition, we use the Poisson potential flow solve done in the incompressible Navier Stokes equations. A Poisson solve is conducted at the beginning of an incompressible flow simulation to obtain initial conditions to the evolution of the velocity and pressure. CHOMBO BoxTools Calculations over union of rectangles AMRTools Communication btw. refinement levels EBTools Embedded boundary discretization AMRElliptic Multigrid solvers on discretized elliptic (poisson, resistivity etc) and parabolic equations AMRTimeDependent Subcycling of time dependent calculations ParticleTools Particle dynamics References: [1] John Kubiatowicz, 2009, “CS252 Graduate Computer Architecture Lecture 24, Network Interface Design Memory Consistency Models” [2] P. Colella et al., 2009, “Chombo Software Package for AMR Applications Design Document” [3]Machine images obtained from nersc.gov, krunker.com, compsource,com and bit-tech.net Poisson Solver OPTIMIZATION OF THE POISSON OPERATOR IN CHOMBO Razvan CarbunescuMeriem Ben SalahAndrew Gearhart 1. Focus on the stencil kernel which applies the poisson operator on a two dimensional cell- centered data that is embedded in the AMRElliptic tool, a multigrid-based elliptic and parabolic equation solver for adaptive mesh hierarchies library. 2. Start with the serial and distributed memory implementation supplied in CHOMBO for reference. 3. Implement parallel shared memory architectures. 4. Conduct a parameter study to analyze the benefits and the drawbacks of each of the implementations in terms of computational time. 5. Draw global conclusions on the consequences of the limitations of the parallel implementation on CHOMBO. Fig.1 Flow Boundary Condition InitialializationFig. 2 Flow Evolution Start-Up After A Poisson Solve Existing Implementations Our interest CHOMBO’s implementation is currently tuned for distributed memory with the domain being split up into small boxes on the order of 32 squared for 2D and 32 cubed for 3D and each box is being assigned to a processor which runs serial Fortran 77 code. Because of the small size of the box there is not a lot of computational intensity to use a threaded shared memory implementation or to hide the cost of the transfer to the GPU. While CHOMBO’s code obtains good performance for MPI on a distributed memory system our study is aimed at determining whether we can improve this speedup through the use of locality and faster access time to on-chip cores. Our current implementation on the shared memory model only uses a limited basic strip-mining technique since it maintains the abstractions of a data iterator going through all the boxes each step but a more relaxed model could help with time also being allowed to be blocked. Another interesting opportunity for speedup is running the operator on the GPU but the benefits must outweigh the cost of moving the data onto and off the GPU. The previous points above raise the issue of threads and how for small boxes the lightweight threads are really important to allow for fast context switching and therefore hopefully increase performance. To allow for the use of pthreads and CUDA we implemented a C version of the operator. The particular reason for this choice of study is to allow for the creation of heterogeneous systems that would automatically adjust their code to either MPI. Pthreads or Cuda underneath to achieve the best result. The serial, the distributed memory and the shared memory implementations of the Poisson solve have been run on the NERSC Cray XT4 system, called Franklin, a massively parallel processing system with 9,532 compute nodes (quad processor cores) and 38,128 processors. The implementation on the GPU is being conducted on personal Linux boxes as well as the r56 and 57 nodes of the Millennium cluster. The number of grid cells in the spatial partition as well as the number of grid refinements affect the computation of the discrete Poisson stencil and are the focus of these test runs. In order to have some sense of fair comparison, the number of grid cells and the number of grid refinements are kept equal through the test runs. To account for different loading of the machine architecture each test was repeated 5 times and its corresponding results averaged. The shared memory implementation has beneficial results vs. the C serial code but more analysis is required to correctly compare to the Fortran MPI code. At the time of this presentation no results have yet completed from the GPU. It will be interesting to find a correct methodology to compare the GPU results from Millennium with the results of MPI and Shared memory implementations. Since Franklin does not provide GPUs, the GPU test runs have to be verified independently. This work will be reported later in the project document. It is obvious that the choice of the C implementation led to a loss of computational performance, and therefore a Fortran 77 shared memory implementation could be attractive. However, a POSIX threads interface to Fortran 77 is currently not available. A improvement on the shared memory implementation might exist in creating the OpenMP Fortran 77 solver code. Currently our simulations are only performed on the Cray XT4 and the GPU. It would be interesting to conduct these studies on different machine architectures. Figure 3 shows a comparison of the runtimes of the C serial code vs the Fortran 77 serial code. Besides the growing time, it is interesting to note that despite the similar matrix storage in Fortran77 and C (column major), the Fortran 77 code is accessed fastest when loops are ordered as n,j,i vs the C code that is indexed i,j,n. Figure 4 depicts the speed-up of the various parallel implementations with respect to the associated serial C version, C_MPI44 refers to running the MPI version with all 4 cores on 1 node and C_MPI41 refers to running the MPI version with 4 cores but 1 core per node. Figure 5 presents the relative speedup of all the codes with respect to the fastest (Fortran) serial version. ncell time Speed up Fig.4 Speedup vs. C serial implementation Fig.5 Speedup vs. F serial implementation Fig.3 Fortran vs C serial implementation