Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.
Computer Abstractions and Technology
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Introduction To System Analysis and Design
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Department of Electrical and Computer Engineering Texas A&M University College Station, TX Abstract 4-Level Elevator Controller Lessons Learned.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
C++ Programming: From Problem Analysis to Program Design, Third Edition Chapter 1: An Overview of Computers and Programming Languages C++ Programming:
Chapter 2 Introduction to Systems Architecture. Chapter goals Discuss the development of automated computing Describe the general capabilities of a computer.
Mapping Techniques for Load Balancing
Prof. Zoltan Francisc Baruch Computer Science Department Technical University of Cluj-Napoca.
On Error Preserving Encryption Algorithms for Wireless Video Transmission Ali Saman Tosun and Wu-Chi Feng The Ohio State University Department of Computer.
Parallel Programming with Java
1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Parallel Programming with Java YILDIRAY YILMAZ Maltepe Üniversitesi.
What is Concurrent Programming? Maram Bani Younes.
1 Web Based Interface for Numerical Simulations of Nonlinear Evolution Equations Ryan N. Foster & Thiab Taha Department of Computer Science The University.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Low-Power Wireless Sensor Networks
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Digital Image Processing Homework II Fast Fourier Transform 2012/03/28 Chih-Hung Lu ( 呂志宏 ) Visual Communications Laboratory Department of Communication.
Parallel Computing Through MPI Technologies Author: Nyameko Lisa Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Chap 1 Introduction. What is OS? OS is a program that interfaces users and computer hardware. Purpose: Provides an environment for users to execute programs.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Fast Fourier Transform & Assignment 2
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
 Programming - the process of creating computer programs.
Introduction to OOP CPS235: Introduction.
Introduction to Operating Systems Prepared by: Dhason Operating Systems.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Parallel Computing Presented by Justin Reschke
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
Holding slide prior to starting show. Processing Scientific Applications in the JINI-Based OGSA-Compliant Grid Yan Huang.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Software Engineering Algorithms, Compilers, & Lifecycle.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
CSCI-235 Micro-Computer Applications
Parallel Programming By J. H. Wang May 2, 2017.
C++ Programming: From Problem Analysis to Program Design
Objective of This Course
STUDY AND IMPLEMENTATION
CSE8380 Parallel and Distributed Processing Presentation
Unit 1: Introduction to Operating System
By Brandon, Ben, and Lee Parallel Computing.
Simulation And Modeling
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina

Motivation AFNI: a widely used software package for medical image processing Drawback: not a real-time system Our goal: make a parallelized version of AFNI First step: parallelize the FFT part of AFNI

Outline What is AFNI FFT in AFNI Introduction of MPI Our method of parallelization Experiment result and analysis Conclusion

What is AFNI? AFNI stands for Analysis of Functional NeuroImages. It is a set of C programs (over 1,000 source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity. AFNI is an interactive program for viewing the results of 3D functional neuroimaging.

How to run AFNI? Log on to clustering machine (daniel.cse.sc.edu) Go to directory /home/ramsey/newafnigo Run “afni” Interface should show up at this time

AFNI Interfaces

AFNI Interfaces --- Cont.

AxialSagittal Coronal

AFNI Interfaces --- Cont. AxialSagittal Coronal

AFNI Interfaces --- Cont. AxialSagittal Coronal

FFT in AFNI Fast Fourier Transform: a kind of finite FT from discrete time domain to discrete spatial domain Reduces the number of computations needed for N points from O(N 2 )to O(NlgN) Extensively used in AFNI To parallelize FFT has great significance for AFNI

What is MPI? MPI stands for Message-Passing Interface. MPI is the most widely used approach to develop a parallel system. MPI has specified a library of functions that can be called from a C or Fortran program. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing.

What is Message Passing? Explicitly transmits data from one process to another Powerful and very general method of expressing parallelism Drawback --- “assembly language of parallel computing”

What does MPI do for us? Makes it possible to write libraries of parallel programs that are both portable and efficient Use of these libraries will hide many of the details of parallel programming Therefore make parallel computing much more accessible to professionals in all branches of science and engineering

Our Objective To parallelize FFT part of AFNI In AFNI, when we call FFT function, we are in fact calling the csfft_cox() function, which we will see the detail in next slide

Flow Chart of csfft_cox fft32 fft128 fft2fft4 3 fft8 fft16 fft64 fft256 fft512 fft1024 fft2048 fft4096 fft8192 fft16384 fft32768 SCLINV fft_4dec return csfft_cox start fft_4dec 3n 5n fft_3dec fft_5dec

One-level parallelization There are several options for us to parallel the csfft_cox() function. At present, we adopt the one-level parallelization method, that is, when fft4096() calls fft1024() and when fft8192() calls fft2048().

Correctness of our parallel code By doing FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file The only difference comes from the storage error of floating point number (in the original code, such phenomena also exists) So, what is the speedup then?

Two Kinds of Time There are two kinds of time in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time). CPU time is the time spent in the calculation part of the code. Wall Clock Time is the total elapsed time from the user’s point of view.

Experiments Time analysis of Original code (4096 * 200,000 * 1) starting FFTs of length at a time TIME 1 ********************************************************************** TIME 1 beginning TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 ********************************************************************** Using csfft TIME 2 ********************************************************************** TIME 2 ending TIME 2 Aending u s: u_t s_t TIME 2 Bending u s: u_t s_t TIME 2 ********************************************************************** wall clock time =

Experiments --- Cont. Time analysis of Parallelized in 2 processors (4096 * 200,000 * 1) starting FFTs of length at a time TIME 1 ********************************************************************** TIME 1 beginning TIME 1 beginning TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 ********************************************************************** Using csfft TIME 2 ********************************************************************** TIME 2 ending TIME 2 ending TIME 2 Aending u s: u_t s_t TIME 2 Bending u s: u_t s_t TIME 2 ********************************************************************** wall clock time =

Experiments --- Cont. Time analysis of Parallelized in 4 processors (4096 * 200,000 * 1) starting FFTs of length at a time TIME 1 ********************************************************************** TIME 1 beginning TIME 1 beginning TIME 1 beginning TIME 1 beginning TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_t TIME 1 ********************************************************************** Using csfft TIME 2 ********************************************************************** TIME 2 ending TIME 2 ending TIME 2 ending TIME 2 ending TIME 2 Aending u s: u_t s_t TIME 2 Bending u s: u_t s_t TIME 2 ********************************************************************** wall clock time =

Analysis of speedup CPU TimeWall Clock Time Original Code Parallelized in 2 processors (rank 0) (rank 1) Parallelized in 4 processors (rank 0) (rank 1) (rank 2) (rank 3)

Analysis of speedup --- Cont. Two main reasons that we did not obtain the ideal speedup : 1. There exist the competitions among different users in the same CPU. 2.Due to the existing communication cost and some other overhead, it is impossible to obtain the ideal speedup in the real machines.

Conclusion We have parallelized the FFT part of AFNI software package based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 30 percent. Increase the speedup of FFT parallelization of 3dDeconvolve program

Questions?