MIT Lincoln Laboratory HPEC 2004-1 JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.

Slides:



Advertisements
Similar presentations
The HPEC Challenge Benchmark Suite
Advertisements

Systolic Arrays & Their Applications
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Parallell Processing Systems1 Chapter 4 Vector Processors.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
BENCHMARK SUITE RADAR SIGNAL & DATA PROCESSING CERES EPC WORKSHOP
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Efficient FPGA Implementation of QR
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Chapter One Introduction to Pipelined Processors.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
MIT Lincoln Laboratory PCAKernels-1 JML 24 Sep 2003 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter)
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
(A) Approved for public release; distribution is unlimited. Experience and results porting HPEC Benchmarks to MONARCH Lloyd Lewins & Kenneth Prager Raytheon.
PTCN-HPEC-02-1 AIR 25Sept02 MIT Lincoln Laboratory Resource Management for Digital Signal Processing via Distributed Parallel Computing Albert I. Reuther.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
My Coordinates Office EM G.27 contact time:
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Buffering Techniques Greg Stitt ECE Department University of Florida.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Computer Maintenance Unit Subtitle: CPU’s Trade & Industrial Education
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Parallel Programming By J. H. Wang May 2, 2017.
Embedded Systems Design
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Architecture & Organization 1
Parallel Processing in ROSA II
CSCI1600: Embedded and Real Time Software
Centar ( Global Signal Processing Expo
Architecture & Organization 1
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Memory System Performance Chapter 3
Parallel Programming in C with MPI and OpenMP
6- General Purpose GPU Programming
CSCI1600: Embedded and Real Time Software
Presentation transcript:

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts Institute of Technology Lincoln Laboratory Eighth Annual High-Performance Embedded Computing Workshop (HPEC 2004) 28 Sep 2004 This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Credits Implementations on RAW: –QR Factorization: Ryan Haney –CFAR: Edmund Wong, Preston Jackson –Convolution: Matt Alexander Research Sponsor: –Robert Graybill, DARPA PCA Program

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Tiled Architectures Monolithic single-chip architectures are becoming rare in the industry –Designs become increasingly complex –Long wires cannot propagate across the chip in one clock Tiled architectures offer an attractive alternative –Multiple simple tiles (or “cores”) on a single chip –Simple interconnection network (short wires) Examples exist in both industry and research –IBM Power4 & Sun Ultrasparc IV each have two cores –AMD, Intel expected to introduce dual-core chips in mid-2005 –DARPA Polymorphous Computer Architecture (PCA) program

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 PCA Block Diagrams TRIPS (University of Texas) Smart Memories (Stanford) ALU RF I$ PC D$ RAW (MIT) All of these are examples of tiled architectures In particular, RAW is a 4x4 array of tiles –Small amount of memory per tile –Scalar operand network allows delivery of operands between functional units –Plans for a 1024-tile RAW fabric This research aims to develop programming methods for large tile arrays

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Outline Introduction Stream Algorithms and Tiled Architectures Mapping Signal Processing Kernels to RAW Conclusions

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Stream Algorithms for Tiled Architectures M(R) edge tiles are allocated to memory management P(R) inner tiles perform computation systolically using registers and static network Decoupled Systolic Architecture lim E( ,R) = 1 ,R   Compute Efficiency Condition: where  = N/R R E (N,R) = C(N) T(N,R)*(P(R) + M(R)) Stream Algorithm Efficiency: where N = problem size R = edge length of tile array C(N) = number of operations T(N,R) = number of time steps P(R) + M(R) = total number of tiles Stream algorithms achieve high efficiency by: –Partitioning the problem into sub-problems –Decoupling memory access from computation –Hiding communication latency Stream algorithms achieve high efficiency by: –Partitioning the problem into sub-problems –Decoupling memory access from computation –Hiding communication latency Time Space

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Example Stream Algorithm: Matrix Multiply Calculate C=A B –Partition A into N/R row blocks, B into N/R column blocks Computations can be pipelined –Cost is 2R cycles to start and drain the pipeline –R cycles to output the result Memory tiles = C AB In each phase, compute R 2 elements of C –Involves 2N operations per tile –N 2 /R 2 phases Compute tiles Efficiency Calculation: E (N,R) = 2N32N3 (2N(N 2 /R 2 )+3R)(R 2 +2R) = 2323 23+323+3 R R+2 lim E( ,R) = 1 ,R   for  = N/R Achieves high efficiency as array size (N) & data size (R) grow N = problem size R = edge length of tile array

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Matrix Multiply Efficiency Stream algorithms achieve high efficiency on large tile arrays We need to identify algorithms that can be recast as stream algorithms Stream algorithms achieve high efficiency on large tile arrays We need to identify algorithms that can be recast as stream algorithms Assume a 4x4 decoupled systolic architecture or RAW surrounded by memory tiles (max efficiency=66%) Scale the number of overall tiles Smaller percentage of tiles devoted to memory leads to higher efficiency

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Analyzing the Matrix Multiply Consider the matrix multiply computation in more detail To compute cij, row i of A is multiplied by column j of B 2N inputs required 2N operations required Examine the directed acyclic graph (DAG) for the matrix multiply For each output produced There are W inputs required (O(N)) The input i is used Q i times (O(N)) –These are intermediate products The matrix multiply is an example of an algorithm with a constant ratio of input data (W) to intermediate products (Q) a11 a12 a21 a22 b11 b21 b12 b22 c11 c12 c21 c22 A constant W/Q implies a degree of scale-invariance: Communication and computation maintain the same ratio as N increases Therefore the implementation can efficiently use more tiles on large problems A constant W/Q implies a degree of scale-invariance: Communication and computation maintain the same ratio as N increases Therefore the implementation can efficiently use more tiles on large problems

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Outline Introduction Stream Algorithms and Tiled Architectures Mapping Signal Processing Kernels to RAW –QR Factorization –Convolution –CFAR –FFT Conclusions

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 RAW Test Board Write kernels to run on prototype RAW board –4x4 RAW chip, 100 MHz MIT software includes cycle-accurate simulator –Code written for the simulator easily runs on board –Initial tests show good agreement between simulator and board Expansion connector allows direct access to RAW static network –Firmware re-programming required –External FPGA board streams data into and out of RAW –Design streams data into ports on corner tiles –Interface is not yet complete so present results are from simulator Memory tiles Store intermediate values Stream data to and from computation tiles Computation tiles Perform computation systolically Use static network and registers I/O tiles Stream data to and from outside world Typical RAW configuration for a stream algorithm on prototype board:

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 QR Factorization Mapping Data flow during rotation computation Data flow during rotation application For each block of columns compute Givens rotations apply Givens rotation to A Algorithm to compute A=QR: 1 I/O Memory Compute Unused 23 For a matrix A with six columns: Column block I/O tiles are only used at start and end of process –In-between, data is stored in memory tiles This shows the flow for odd-numbered column blocks –For even-numbered blocks of columns, data flows from bottom memory tiles to the top of the array Store R Store rotations Pass rotations Store R, updated A

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Complex QR Factorization Performance The QR factorization has a constant ratio of input data (W) to intermediate products (Q) M(R) P(R) R N 80 RN Projected matrix size N 80 to achieve 80% efficiency on compute tiles P(R): The QR factorization efficiency scales to 100% as array and data size increase

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Compute TilesMemory and I/O Tiles Input Vector Filter Input Vector Filter Result n+k-1… 10 k-1 … 10 n-1 … 10 n+k-1 … 10 k-1 10 n-1 … 10 Convolution (Time Domain) Mapping Filter coefficients distributed cyclically to tiles –Each compute tile convolves the input with a subset of the filter –Assume n (data length) > k (filter length) Each stream is a different convolution operation –In multichannel signal processing applications we rarely perform just one convolution 12 of 16 tiles used for computation –Maximum 75% efficiency Stream 0 Stream … … 11 Tile 0Tile 1Tile 2 Tile 3 Tile 4Tile 5

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Convolution Performance Convolution achieves good performance in RAW simulator Longer filters and input vectors are more efficient Longer input vectors are also more easily mapped to more processors

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 CFAR Mapping C(i,j,k) GN cfar G T(i,j,k) Constant False-Alarm Rate (CFAR) Detection For each output: –There are W = O(N cfar ) inputs required –The input i is used Q i = O(1) times Constant False-Alarm Rate (CFAR) Detection For each output: –There are W = O(N cfar ) inputs required –The input i is used Q i = O(1) times For a long stream, CFAR requires 7 ops/cell Consider dividing up a stream over R tiles –7/R operations per tile –N communication steps per tile –Communication quickly dominates computation Instead consider parallel processing of streams

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 N rg Range Gates CFAR Mapping RAW Chip Data cube is streamed into RAW using the static network Corner input ports receive data Each quadrant processes data from one port One row of range data (“one stream”) is processed by a single tile Results gathered to corner tile and output C(i,j,k) GN cfar G T(i,j,k) Constant False-Alarm Rate (CFAR) Detection For each output: –There are W = O(N cfar ) inputs required –The input i is used Q i = O(1) times Goal is to move data through the chip as fast as possible Constant False-Alarm Rate (CFAR) Detection For each output: –There are W = O(N cfar ) inputs required –The input i is used Q i = O(1) times Goal is to move data through the chip as fast as possible This implementation does not scale with array size R –As R increased, there would be a greater latency involved in using tiles in the center of the chip

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 CFAR Performance CFAR achieves an efficiency of 11-15% –Efficiency on conventional architectures = 5-10%, similarly optimized –RAW implementation benefits from large off-chip bandwidth Compute tile efficiency does not scale to 100% as for Stream Algorithms (matrix multiply, convolution, QR) Stream fits in cacheStream does not fit in cache

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Data Flow for the FFT W/Q is O(N/log 2 N) –As N increases, communication requirements grow faster than computation –Therefore we expect that the Radix-2 FFT cannot efficiently scale For each of (log 2 N) stages compute N/2 “butterflies” Cooley-Tukey Radix-2 FFT: a b a+  b a-  b Radix-2 butterfly: 2 complex inputs precomputed weight  10 real operations For each output produced There are W inputs required (O(N)) The input i is used Q i times (O(log 2 N)) –These are intermediate computations

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping the Radix-2 FFT to a Tile Array For each butterfly: –4 + (R-1) cycles to clock inputs across the array –10/R computations per tile –When R=2, tiles are used efficiently Can overlap computation (5 cycles) and communication (5 cycles) –When R>2, cannot use tiles efficiently Latency to clock inputs > number of ops per tile For each stage: –Pipeline N/2 butterflies on R rows or columns Overall efficiency limited to 50% –2x2 compute tiles + 4 memory tiles Stage 1: Stage 2: Stage 3:

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping the Radix-R FFT to a Tile Array A Radix-R FFT algorithm –Uses log R N stages –Compute N/R Radix-R butterflies per stage Implement the radix-R butterfly with an R-point DFT –W, Q both scale with R for a DFT –Allows us to use more processors for each stage –Still becomes inefficient as R gets “too large” –Efficiency limit for radix-4 algorithm = 56% –Efficiency limit for radix-8 algorithm = 54% Radix-4 implementation: –Distribute a radix-4 butterfly over 4 processors in a row or column –Perform 4 butterflies in parallel –8 memory tiles required Idea: use a Radix-R FFT algorithm on an R by R array

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Radix-4 FFT Algorithm Performance Example: Radix-4 FFT algorithm achieves high throughput on 4x4 RAW –Comparable efficiency to FFTW on G4, Xeon Raw efficiency stays high for larger FFT sizes G4, Xeon FFT results from Simulated Radix-4 FFT on 4x4 RAW plus 8 memory tiles

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Classifying Kernels Kernels may be classified by the ratio W/Q Constant Ratio: W = O(N), Q i = O(N) –e.g., Matrix Multiply, QR, Convolution –Stream algorithms: efficiency approaches 1 as R, N/R increase Sub-Linear Ratio: W=O(N), Q i < O(N); –e.g., FFT –Require trade-off between efficiency and scalability Linear Ratio: W = O(N), Q i = O(1); –e.g., CFAR –Difficult to find efficient or scalable implementation Constant Linear W/Q Data set size, N Sub-linear Examining W/Q gives insight into whether a stream algorithm exists for the kernel

MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Conclusions Stream algorithms map efficiently to tiled arrays –Efficiency can approach 100% as data size and array size increase –Implementations on RAW simulator show the efficiency of this approach –Will be moving implementations from simulator to board The communication-to-computation ratio W/Q gives insight into the mapping process –A constant W/Q seems to indicate a stream algorithm exists –When W/Q is greater than a constant it is hard to efficiently use more processors This research could form the basis for a methodology of programming tile arrays –More research and formalism required