CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Stupid Columnsort Tricks Geeta Chaudhry Tom Cormen Dartmouth College Department of Computer Science.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Reference: Message Passing Fundamentals.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.

Computer Science 1620 Multi-Dimensional Arrays. we used arrays to store a set of data of the same type e.g. store the assignment grades for a particular.

May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

Parallelizing the Fast Fourier Transform David Monismith cs599.

18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.

Real Numbers and the Decimal Number System

Face Detection using the Viola-Jones Method

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Lecture 22 MA471 Fall Advection Equation Recall the 2D advection equation: We will use a Runge-Kutta time integrator and spectral representation.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Parallelization of the Classic Gram-Schmidt QR-Factorization

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Distributed Data Assimilation - A case study Aad J. van der Steen High Performance Computing Group Utrecht University 1. The application 2. Parallel implementation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

SINGULAR VALUE DECOMPOSITION (SVD)

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

Parallel Computing Presented by Justin Reschke

CSCI-455/552 Introduction to High Performance Computing Lecture 15.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Parallel FFT in Julia Review of FFT.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Linear Algebra Review.

An Iterative FFT We rewrite the loop to calculate nkyk[1] once

Parallel Programming By J. H. Wang May 2, 2017.

Lecture 2: Intro to the simd lifestyle and GPU internals

Parallel Matrix Multiplication and other Full Matrix Algorithms

Software life cycle models

Parallel Matrix Multiplication and other Full Matrix Algorithms

Parallelization of CPAIMD using Charm++

By Brandon, Ben, and Lee Parallel Computing.

Sorting "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Project Description Use a Spectral Method (Fourier Method) for the equation: Use the JST Runge-Kutta Time Integrator for each time step.

Algorithm For each time step that we take, we do s sub stages:

Algorithm with Spectral Representation

Code Development Develop Serial C Code based off given Matlab code using FFTw libraries for fft and ifft calls Very straightforward Verification of code working correctly was simply comparing with Matlab result Develop Parallel C Code based off Serial C Code The FFTw libraries provide fft and ifft calls that do all MPI Calls for you. The tricky part of this development was placing the data correctly on each processor for the fft and ifft calls. Verification of code working correctly was again comparison with Matlab result

Results: N=512, 1000 Iterations

Usage of FFTw Libraries in Parallel: Function Calls Notice: Message Passing is transparent to the user

Usage of FFTw Libraries in Parallel: MPI Data Layout The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over. In particular, the array is divided according to the rows (first dimension) of the data: each process gets a subset of the rows of the data. (This is sometimes called a "slab decomposition.") One consequence of this is that you can't take advantage of more processors than you have rows (e.g. 64x64x64 matrix can at most use 64 processors). This isn't usually much of a limitation, however, as each processor needs a fair amount of data in order for the parallel- computation benefits to outweight the communications costs. Taken from FFTw website/documentation

Usage of FFTw Libraries in Parallel: MPI Data Layout These calls needed to create fft and ifft plan, as well as find out what memory needs are to be met

Usage of FFTw Libraries in Parallel: MPI Data Layout ilocal_x_start tells us where we are in the global 2d array (row) and ilocal_nx tells us how many elements we have on this current processor. Using Row-Major Format

Notice: Message Passing is transparent to the user

Parallel Results Two versions written A Non-Efficient version that is not optimized for FFTw MPI calls: An extra work array is not used. An extra un-transposing of data is done prior to coming out of fft calls. An Efficient version that is optimized for FFTw MPI calls: An extra work array is used Data is left transposed so that an extra communication step of un-transposing data is not done

Notice: The slight differences

Efficient Version is Faster and more efficient.

We begin to see some scaling, however, efficiency starts to taper off indicating that much of the time spent is in communication.

Overall, we see the same trend as N increases, i.e. some scaling as Number of Procs increases, but starts to flatten, and efficiency steadily decreases.

The Sea of Black for the Non-Efficient Version N=256, 10 Iterations

A lot of communication between processors.

Communication goes on between each processor with MPI_SendRecv since each processor needs data from each other. We can actually see here when a fft is being performed.

8 processors and 16 processors: same trend of communication.

The sea of white for the Efficient Version. N=256, 10 Iterations

The Efficient Version uses MPI_AlltoAll for its communication between all processors.

We again can see when an fft call is being performed by each white bar for each process.

8 processors and 16 processors: same trend of communication.

Conclusions A lot of time is spent in communication since each process communicates with each other process. Efficiency goes down as a result because as number of process increases for a given size N, more communication is needed. We saw some scaling, but this starts to drop off as number of processors increases (efficiency issues). Time Spent on this project Code Development: ~8 hours with debugging Data Collection: ~2 days Overall: Quite a bit of time