HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Slides:



Advertisements
Similar presentations
DFT & FFT Computation.
Advertisements

Computer Architecture
David Hansen and James Michelussi
DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Chapter 2 Data Manipulation Dr. Farzana Rahman Assistant Professor Department of Computer Science James Madison University 1 Some sldes are adapted from.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
DH2T 34 Computer Architecture 1 LO2 Lesson Two CPU and Buses.
Computer Architecture and Data Manipulation Chapter 3.
CENG536 Computer Engineering Department Çankaya University.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Computer System Overview
Computer System Overview
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Basics and Architectures
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
D75P 34R HNC Computer Architecture 1 Week 9 The Processor, Busses and Peripherals © C Nyssen/Aberdeen College 2003 All images © C Nyssen /Aberdeen College.
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
IT253: Computer Organization
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
MIT Lincoln Laboratory Projective Transform on Cell: A Case Study Sharon Sacco, Hahn Kim, Sanjeev Mohindra, Peter Boettcher, Chris Bowen, Nadya Bliss,
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Operating System Isfahan University of Technology Note: most of the slides used in this course are derived from those of the textbook (see slide 4)
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
What is a program? A sequence of steps
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
DIRECT MEMORY ACCESS and Computer Buses
Cache Memory Presentation I
Parallel Processing in ROSA II
Centar ( Global Signal Processing Expo
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Multistep Processing of a User Program
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
VSIPL++: Parallel Performance HPEC 2004
Memory System Performance Chapter 3
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
Presentation transcript:

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 1D Fourier Transform Mapping 1D FFTs onto Cell 1D as 2D Traditional Approach Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 1D Fourier Transform g j =  f k e -2  ijk/N k = 0 N-1 This is a simple equation A few people spend a lot of their careers trying to make it run fast This is a simple equation A few people spend a lot of their careers trying to make it run fast

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Mapping 1D FFT onto Cell Small FFTs can fit into a single LS memory is the largest size. Large FFTs must use XDR memory as well as LS memory. FFT Data Cell FFTs can be classified by memory requirements Medium and large FFTs require careful memory transfers Cell FFTs can be classified by memory requirements Medium and large FFTs require careful memory transfers Medium FFTs can fit into multiple LS memory is the largest size.

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 1D as 2D Traditional Approach w0w0 w0w0 w0w0 w0w0 w0w0 w1w1 w2w2 w3w3 w0w0 w2w2 w4w4 w6w6 w0w0 w3w3 w6w6 w9w Corner turn to compact columns 2. FFT on columns 3. Corner turn to original orientation 4. Multiply (elementwise) by central twiddles 5. FFT on rows 6. Corner turn to correct data order 1D as 2D FFT reorganizes data a lot –Timing jumps when used Can reduce memory for twiddle tables Only one FFT needed 1D as 2D FFT reorganizes data a lot –Timing jumps when used Can reduce memory for twiddle tables Only one FFT needed

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Communications Memory Cell Rounding Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Communications Bandwidth to XDR memory 25.3 GB/s SPE connection to EIB is 50 GB/s Minimizing XDR memory accesses is critical Leverage EIB Coordinating SPE communication is desirable –Need to know SPE relative geometry Minimizing XDR memory accesses is critical Leverage EIB Coordinating SPE communication is desirable –Need to know SPE relative geometry EIB bandwidth is 96 bytes / cycle

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Memory XDR Memory is much larger than 1M pt FFT requirements Each SPE has 256 KB local store memory Each Cell has 2 MB local store memory total Need to rethink algorithms to leverage the memory –Consider local store both from individual and collective SPE point of view Need to rethink algorithms to leverage the memory –Consider local store both from individual and collective SPE point of view

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Cell Rounding The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive Accuracy should be improved by minimizing steps to produce a result in algorithm IEEE 754 Round to NearestCell (truncation) b00 b01b10b01b10 1 bit Average value – x bits Average value – x bit

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Using Memory Well Reducing Memory Accesses Distributing on SPEs Bit Reversal Complex Format Computational Considerations Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 FFT Signal Flow Diagram and Terminology Size 16 can illustrate concepts for large FFTs –Ideas scale well and it is “drawable” This is the “decimation in frequency” data flow Where the weights are applied determines the algorithm Size 16 can illustrate concepts for large FFTs –Ideas scale well and it is “drawable” This is the “decimation in frequency” data flow Where the weights are applied determines the algorithm butterfly block radix 2 stage

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Reducing Memory Accesses Columns will be loaded in strips that fit in the total Cell local store FFT algorithm processes 4 columns at a time to leverage SIMD registers Requires separate code from row FFTS Data reorganization requires SPE to SPE DMAs No bit reversal

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 1D FFT Distribution with Single Reorganization reorganize One approach is to load everything onto a single SPE to do the first part of the computation After a single reorganization each SPE owns an entire block and can complete the computations on its points One approach is to load everything onto a single SPE to do the first part of the computation After a single reorganization each SPE owns an entire block and can complete the computations on its points

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 1D FFT Distribution with Multiple Reorganizations reorganize A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until the SPEs own a full block reorganize

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Selecting the Preferred Reorganization Number of SPEs Number of Exchanges Data Moved in 1 DMA Number of Exchanges Data Moved in 1 DMA 22N / N / 168N / 8 856N / 6424N / 16 Single ReorganizationMultiple Reorganizations Evaluation favors multiple reorganizations –Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses –DMA overhead (~.3  s) is minimized –Programming is simpler for multiple reorganizations Evaluation favors multiple reorganizations –Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses –DMA overhead (~.3  s) is minimized –Programming is simpler for multiple reorganizations N - the number of elements in SPE memory, P - number of SPEs Number of exchanges P * log 2 (P) Number of elements exchanged (N / 2) * log 2 (P) Number of exchanges P * (P – 1) Number of elements exchanged N * (P – 1) / P Typical N is 32k complex elements

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Column Bit Reversal Bit reversal of columns can be implemented by the order of processing rows and double buffering Reversal row pairs are both read into local store and then written to each others memory location Binary Row Numbers Exchanging rows for bit reversal has a low cost DMA addresses are table driven Bit reversal table can be very small Row FFTs are conventional 1D FFTs

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Complex Format Interleaved complex format reduces number of DMAs Two common formats for complex –interleaved –split real 1 real 0 imag 0imag 1 real 0real 1imag 1 imag 0 Complex format for user should be standard Internal format conversion is light weight Internal format should benefit the algorithm –Internal format is opaque to user SIMD units need split format for complex arithmetic

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Using Memory Well Computational Considerations Central Twiddles Algorithm Choice Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Central Twiddles Central twiddles can take as much memory as the input data Reading from memory could increase FFT time up to 20% For 32-bit FFTs central twiddles can be computed as needed –Trigonometric identity methods require double precision Next generation Cell should make this the method of choice –Direct sine and cosine algorithms are long w0w0 w0w0 w0w0 w1w1 w0w0 w2w2 w0w0 w2w2 w0w0 w4w4 w3w3 w3w3 w0w0 w0w0 w 1023 * 1023 w0w0 … …... Central twiddles are a significant part of the design Central Twiddles for 1M FFT

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Algorithm Choice Cooley-Tukey has a constant operation count –2 muladd to compute each result for each stage Gentleman-Sande varies widely in operation count –1 – 3 operations for each result DC term has the same accuracy on both Gentleman-Sande worst term has 50% more roundoff error when fused multiply-add is available Cooley-Tukey -w k wkwk a b a + b * w k a - b * w k Gentleman-Sande -w k wkwk a b a + b (a – b) * w k Computational Butterflies

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Radix Choice What counts for accuracy is how many operations from the input to a particular result Cooley Tukey Radix 4 t 1 = x l * w z t 2 = x k * w z/2 t 3 = x m * w 3z/2 s 0 = x j + t 1 a 0 = x j – t 1 s 1 = t 2 + t 3 a 1 = t 2 – t 3 y j = s 0 + s 1 y k = s 0 – s 1 y l = a 0 – i * a 1 y m = a 0 + i * a 1 Number of operations for 1 radix 4 stage (real or imaginary: 9 ( 3 mul, 3 muladd, 3 add) Number of operations for 2 radix 2 stages (real or imaginary) : 6 ( 6 muladd) Higher radices reuse computations but do not reduce the amount of arithmetic needed for computation Fused multiply add instructions are more accurate than multiply followed by add Radix 2 will give the best accuracy

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Estimating Performance Timing estimates typically cannot include all factors –Computation is based on the minimum number of instructions to estimate –I/O timings are based on bus speed and amount of data –Experience is a guide for the efficiency of I/O and computations I/O estimate: Number of byte transfers to/from XDR memory: (minimum) Bus speed to XDR: 25.3 GHz Estimated efficiency: 80% Minimum I/O time: 1.7 ms Computation estimate: Number of operations: Maximum FLOPS: 205 Estimated efficiency: 85% Minimum Computation time:.6 ms (8 SPEs) 1M FFT estimate (without full ordering): 2 ms

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Preliminary Timing Results Timing results are close to predictions –4 SPEs about a factor of 4 from prediction –8 and 16 SPEs closer to prediction Timings were performed on Mercury CTES GHz Dual Cell Blades

MIT Lincoln Laboratory HPEC SMHS 9/24/2008 Summary A good FFT design must consider the hardware features –Optimize memory accesses –Understand how different algorithms map to the hardware Design needs to be flexible in the approach –“One size fits all” isn’t always the best choice –Size will be a factor Estimates of minimum time should be based on the hardware characteristics 1M point FFT is difficult to write, but possible