Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Chapter 6Java: an Introduction to Computer Science & Programming - Walter Savitch 1 l Array Basics l Arrays in Classes and Methods l Programming with Arrays.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

High Performance Linear Transform Program Generation for the Cell BE

Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Simple, Efficient, Portable Decomposition of Large Data Sets William Lundgren Gedae), David Erb (IBM), Max Aguilar (IBM), Kerry Barnes.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Generalized and Hybrid Fast-ICA Implementation using GPU

Accelerating PFA FFT: Performance Comparison

Presentation Title September 22, 2018

Parallel Computers Today

Introduction to MATLAB

Simulation And Modeling

Mapping the FFT Algorithm to the IBM Cell Processor

Memory System Performance Chapter 3

6- General Purpose GPU Programming

Multicore and GPU Programming

Husky Energy Chair in Oil and Gas Research

Presentation transcript:

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae) HPEC 2009

Introduction Processing is either limited by memory or CPU bandwidth –Challenge is to achieve the practical limit of processing –Large 2-D FFTs are limited by memory bandwidth Automating details of implementation provides developer with more opportunity to optimize structure of algorithm Cell Broadband Engine is a good platform for studying efficient use of memory bandwidth –Data movements are exposed and can be controlled –Cache managers hide the data movement Intel X86 & IBM Power processor clusters, Larrabee, Tilera, etc. have similar challenges 2

SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS Cell/B.E. Memory Hierarchy Each SPE core has a 256 kB local storage Each Cell/B.E. chip has a large system memory 3 SPE PPE EIB Cell/B.E. Chip LS SYSMEM Bridge EIB Cell/B.E. Chip SYSMEM Duplicate or heterogeneous Subsystems PPEBridge SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS

# Procs (Times are measured within Gedae) Effect of Tile Size on Throughput

Limiting Resource is Memory Bandwidth Simple algorithm: FFT, Transpose, FFT –Scales well to large matrices Data must be moved into and out of memory 6 times for a total of –2*4*512*512*6 = 12.6e6 bytes –12.6e6/25.6e9 = mSec -Total flops = 5*512*log2(512) * 2 * 512 = 23.6e e6/204.8e9 = mSec -Clearly the limiting resource is memory IO Matrices up to 512x512, a faster algorithm is possible ―Reduces the memory IO to 2 transfers into and 2 out of system memory ―The expected is time based on the practical memory IO bandwidth shown on the previous chart is 62 gflops 5

Overview of 4 Phase Algorithm Repeat C1 times –Move C2/Procs columns to local storage –FFT –Transpose C2/Procs * R2/Procs matrix tiles and move to system memory Repeat R1 times –Move R2/Procs * C2/Procs tiles to local storage –FFT –Move R2/Procs columns to system memory 6

Optimization of Data Movement We know that to achieve optimal performance we must use buffers with row length >= 2048 bytes Complexity beyond the reach of many programmers Approach: Use Idea Language (Copyright Gedae, Inc.) to design the data organization and movement among memories Implement on Cell processor using Gedae DF Future implementation will be fully automated from Idea design The following charts show the algorithm design using the Idea Language and implementation and diagrams 7 Multicores require the introduction of fundamentally new automation.

Row and Column Partitioning Procs = number of processors (8) –Slowest moving row and column partition index Row decomposition –R1 = Slow moving row partition index (4) –R2 = Second middle moving row partition index (8) –R3 = Fast moving row partition index (2) –Row size = Procs * R1 * R2 * R3 = 512 Column Decomposition –C1 = Slow moving column partition index (4) –C2 = Second middle moving column partition index (8) –C3 = Fast moving column partition index (2) –Column size = Procs * C1 * C2 * C3 = 512 8

Notation – Ranges and Comma Notation range r1 = R1; –is an iterator that takes on the values 0, 1, … R1-1 –#r1 equals R1, the size of the range variable We define ranges as: –range rp = Procs; range r1 = R1; range r2 = R2; range r3 = R3; –range cp = Procs; range r1 = R1; range r2 = R2; range r3 = R3; –range iq = 2; /* real, imag */ We define comma notation as: –x[rp,r1] is a vector of size #rp * #r1 and equivalent to x[rp * #r1 + r1] 9

512x512 Input Matrix Input matrix is 512 by 512 –range r = R; range c = C; r  rp,r1,r2,r3 c  cp,c1,c2,c3 –range iq = 2 Split complex (re, im) –x[iq][r][c] x512

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 Distribution of Data to SPEs Decompose the input matrix by row for processing on 8 SPEs [rp]x1[iq][r1,r2,r3][c] = x[iq][rp,r1,r2,r3][c] 11 SPE 0 System Memory SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Stream Submatrices from System Memory into SPE Consider 1 SPE Stream submatrices with R3 (2) rows from system memory into local store of the SPE. Gedae will use list DMA to move the data [rp]x2[iq][r3][c](r1,r2) = [rp]x1[iq][r1,r2,r3][c] 12 System Memory Local Storage

FFT Processing Process Submatrices by Computing FFT on Each Row –[rp]x3[r3][iq][c]= fft([rp]x2[iq][r3]) –Since the [c] dimension is dropped from the argument to the fft function it is being passed a complete row (vector). –Stream indices are dropped. The stream processing is automatically implemented by Gedae. 13 Notice that the real and imaginary data has been interleaved to keep each submatrix in contiguous memory. FFT Processing Each sub matrix contains 2 rows of real and 2 rows of imaginary data. 

Create Buffer for Streaming Tiles to System Memory Phase 1 Collect R1 submatrices into a larger buffer with R2*R3 (16) rows –[rp]x4[r2,r3][iq][c] = [rp]x3[r3][iq][c](r2) –This process is completed r1 (4) times to process the full 64 rows. 14 There are now R2*R3 (16) rows in memory. Stream Data Into Buffer  ……

Transpose Tiles to Contiguous Memory A tile of real and a tile of imaginary are extracted and transposed to continuous memory [rp]x5[iq][c2,c3][r2,r3](cp,c1) = [rp]x4[r2,r3][iq][cp,c1,c2,c3] –This process is completed r1 (4) times to process the full 64 rows. 15 Now there is a stream of Cp*C1 (32) tiles. Each tile is IQ by R2*R3 by C2*C3 (2x16x16) data elements. Transpose Data Into Stream of tiles  …… ………… …

…… Stream Tiles into System Memory Phase 1 Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. [rp]x6[r1,cp,c1][iq][c2,c3] = [rp]x5[iq][c2,c3][r2,r3](r1,cp,c1) –The r1 iterations were created on the initial streaming of r1,r2 submatrices. 16 Now there is a buffer of R1*Cp*C1 (128) tiles each IQ by R2*R3 by C2*C3 Stream of tiles into larger buffer in system memory  … …… …… …… …… …… … Tile 1Tile R1*Cp*C1-1

Stream Tile into System Memory Phase 2 Collect buffers from each SPE into full sized buffer in system memory. –x7[rp,r1,cp,c1][iq][c2,c3][r2,r3] = [rp]x6[r1,cp,c1][iq][c2,c3][r2,r3] [rp]x6[r1,cp,c1][iq][c2,c3][r2,r3] –The larger matrix is created by Gedae and the pointers passed back to the box on the SPE that is DMA’ng the data into system memory 17 The buffer is now Rp*R1*Cp*C1 (1024) tiles. Each tile is IQ by R2*R3 by C2*C3 Collect tiles into larger buffer in system memory  … … … … … ……… SPE 0SPE 1SPE 7

Stream Tiles into Local Store Extract tiles from system memory to create 16 full sized columns (r index) in local store. [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) = x7[rp,r1,cp,c1][iq][c2,c3][r1,r2]; –All SPEs have access to full buffer to extract data in a regular but scattered pattern. The buffer in local store is now Rp*R1 (32) tiles. Each tile is IQ by C2*C3 by R2*R3 (2x16x16). This scheme is repeated C1 (4) times on each SPE. Collect tiles into local store from regular but scattered locations in system memory  … … … … … … … … SPE 0 SPE 1 SPE 7

… … Stream Tiles into Buffer Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. [cp]x9[iq][c2,c3][rp,r1,r2,r3](c1) = [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) –The r1 iterations were created on the initial streaming of r1,r2 submatrices. 19 Now there is a buffer of R1*Cp*C1 (128) tiles each IQ by R2*R3 by C2*C3 Stream of tiles into full length column (r index) buffer with a tile copy.  … … … … … … … … … … … …

Process Columns with FFT Stream 2 rows into an FFT function that places the real and imaginary data into separate buffers. This allows reconstructing a split complex matrix in system memory. [cp]x10[iq][c3][r](c2) = fft([cp]x9[iq][c2,c3]); 20 Stream 2 rows of real and imaginary data into FFT function. Place data into separate buffers on output.  … …

Create Buffer for Streaming Tiles to System Memory Collect R2 submatrices into a larger buffer with R2*R3 (16) rows –[p]x11[iq][c2,c3][r] = [cp]x10[iq][c3][r](c2) –This process is completed c1 (4) times to process the full 64 rows. 21 There are now R2*R3 (16) rows in memory. Stream Data Into Buffer  ……

Create Buffer for Streaming Tiles to System Memory Collect R2 submatrices into a larger buffer with R2*R3 (16) rows –[p]x12[iq][c1,c2,c3][r] = [p]x11[iq][c2,c3][r](c1) 22 There are now 2 buffers of C1*C2*C3 (128) rows of length R (512) in system memory. Stream submatrices into buffer  ……

Distribution of Data to SPEs Decompose the input matrix by row for processing on 8 SPEs x13[iq][cp,c1,c2,c3][r] = [cp]x12[iq][c1,c2,c3][r] 23 SPE 0 System Memory SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 Only real plane represented in picture. Each SPE will produce C1*C2*C3 = 64 rows.

512x512 Output Matrix Output matrix is 512 by 512 –y[iq][c][r] x512

Results Two days to design algorithm using Idea language –66 lines of code One day to implement on Cell –20 kernels, each with lines of code –14 parameter expressions –Already large savings in amount of code over difficult handcoding –Future automation will eliminate coding at the kernel level altogether Achieved 57* gflops out of the theoretical maximum 62 gflops * Measured on CAB Board at 2.8 ghz adjusted to 3.2 ghz. The measure algorithm is slightly different and expected to increase to 60 gflops. 25

Gedae Status 12 months into an 18/20 month repackaging of Gedae –Introduced algebraic features of the Idea language used to design this algorithm –Completed support for hierarchical memory –Embarking now on code overlays and large scale heterogeneous systems –Will reduce memory footprint to enable 1024 x D FFT 26