Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 24, 2003 Authors Ken Cameron

Similar presentations


Presentation on theme: "An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 24, 2003 Authors Ken Cameron"— Presentation transcript:

1 An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 24, 2003 Authors Ken Cameron (ken@clearspeed.com)ken@clearspeed.com Mike Koch (michael.j.koch@lmco.com)michael.j.koch@lmco.com Simon McIntosh-Smith(simon@clearspeed.com)simon@clearspeed.com Rick Pancoast (rick.pancoast@lmco.com)rick.pancoast@lmco.com Joe Racosky (joseph.r.racosky@lmco.com)joseph.r.racosky@lmco.com Stewart Reddaway (sfr@wscapeinc.com)sfr@wscapeinc.com Pete Rogina (pete@wscapeinc.com)pete@wscapeinc.com David Stuttard (daves@clearspeed.com)daves@clearspeed.com

2 HPEC Workshop September 24, 2003 2 Agenda Summary Architecture Q & A Benchmarking Applications Overview

3 HPEC Workshop September 24, 2003 3 Overview  Work Objective  Provide benchmark for Array Processing Technology –Enable embedded processing decisions to be accelerated for upcoming platforms (radar and others) –Provide cycle accurate simulation and hardware validation for key algorithms –Extrapolate 64 PE pilot chip performance to WorldScape 256 PE product under development –Support customers strategic technology investment decisions  Share results with industry –New standard for performance AND performance per watt Overview Architecture Applications Summary Q&A Benchmarking

4 HPEC Workshop September 24, 2003 4 Architecture  ClearSpeed’s Multi Threaded Array Processor Architecture – MTAP  Fully programmable in C  Hardware multi-threading  Extensible instruction set  Fast processor initialization/restart Overview Architecture Applications Summary Q&A Benchmarking

5 HPEC Workshop September 24, 2003 5 Architecture  ClearSpeed’s Multi Threaded Array Processor Architecture – MTAP (cont.)  Scalable internal parallelism –Array of Processor Elements (PEs) –Compute and bandwidth scale together –From 10s to 1,000s of PEs –Built-in PE fault tolerance, resiliency  High performance, low power –~10 GFLOPS/Watt  Multiple high speed I/O channels Overview Architecture Applications Summary Q&A Benchmarking

6 HPEC Workshop September 24, 2003 6 Architecture  Processor Element Structure  ALU + accelerators: integer MAC, FPU  High-bandwidth, multi- port register file  Closely-coupled SRAM for data  High-bandwidth per PE DMA: PIO, SIO –64 to 256-bit wide paths  High-bandwidth inter-PE communication Overview Architecture Applications Summary Q&A Benchmarking

7 HPEC Workshop September 24, 2003 7 Architecture  Processor Element Structure (cont.)  256 PEs at 200MHz: –200 GBytes/s bandwidth to PE memory –100 GFLOPS –Many GBytes/s DMA to external data  Supports multiple data types: –8, 16, 24, 32-bit,... fixed point –32-bit IEEE floating point Overview Architecture Applications Summary Q&A Benchmarking

8 HPEC Workshop September 24, 2003 8 Architecture  Architecture DSP Features:  Multiple operations per cycle –Data-parallel array processing –Internal PE parallelism –Concurrent I/O and compute –Simultaneous mono and poly operations  Specialized per PE execution units –Integer MAC, Floating-Point Unit  On-chip memories –Instruction and data caches –High bandwidth PE “poly” memories –Large scratchpad “mono” memory  Zero overhead looping –Concurrent mono and poly operations Overview Architecture Applications Summary Q&A Benchmarking

9 HPEC Workshop September 24, 2003 9 Architecture  Software Development:  The MTAP architecture is simple to program: –Architecture and compiler co-designed –Uniprocessor, single compute thread programming model –Array Processing: one processor, many execution units  RISC-style pipeline –Simple, regular, 32-bit, 3 operand instruction set  Extensible instruction set  Large instruction and data caches  Built-in debug: single-step, breakpoint, watch Overview Architecture Applications Summary Q&A Benchmarking

10 HPEC Workshop September 24, 2003 10 Architecture  Software Development Kit (SDK):  C compiler, assembler, libraries, visual debugger etc.  Instruction set and cycle accurate simulators  Available for Windows, Linux and Solaris  Development boards & early silicon available from Q4 03  Application development support:  Reference source code available for various applications  Consultancy direct with ClearSpeed’s experts  Consultancy and optimized code from ClearSpeed’s partners Overview Architecture Applications Summary Q&A Benchmarking

11 HPEC Workshop September 24, 2003 11  ClearSpeed debugger: Architecture Overview Architecture Applications Summary Q&A Benchmarking

12 HPEC Workshop September 24, 2003 12 Benchmarking  FFTs on 64 PE pilot chip and 256 PE processor  Benchmark for FFT  Complex Multiply  IFFT –FFT benchmark example is 1024-point, complex –IEEE 32-bit floating point  Each FFT mapped onto a few PEs –Higher throughput than one FFT at a time  In-situ assembler codes –Single 1024-point FFT or IFFT spread out over 8 PEs –128 complex points per PE –Output bit-reversed mapping in poly RAM –IFFT input mapping matches bit-reversed FFT output Overview Architecture Applications Summary Q&A Benchmarking

13 HPEC Workshop September 24, 2003 13  FFTs on 64 PE pilot chip and 256 PE processor Ÿ 128 complex points per PE enables – High throughput – Enough poly RAM buffer space for multi-threading to overlap I/O and processing Ÿ Complex multiply code – Multiplies corresponding points in two arrays – Enables convolutions and correlations to be done via FFTs Ÿ Performance measured on Cycle Accurate Simulator (CASIM) – Individual FFT – IFFT – Complex Multiply – Iterative loops including I/O – FFT – CM – IFFT (using a fixed reference function) Overview Architecture Applications Summary Q&A Benchmarking

14 HPEC Workshop September 24, 2003 14 Benchmarking  Detail on the current FFT codes  4 complete radix-2 butterfly instructions in microcode –Decimation-in-time (DIT) - twiddle multiply on one input –DIT variant - multiply inserts an extra 90 degree rotation –Decimation-in-frequency (DIF) - multiply on one output –DIF variant - extra 90 degree rotation inserted –These instructions take 16 cycles Overview Architecture Applications Summary Q&A Benchmarking

15 HPEC Workshop September 24, 2003 15 Benchmarking  Detail on the current FFT codes (cont.)  Twiddle factors pre-stored in poly RAM –Use of 90 degree variants halves twiddle factors storage –IFFT uses same twiddle factors –Fastest codes require 68 twiddle factors per PE  I/O routines transfer data between poly and mono RAM  Mono data is in entirely natural order  Complete set of 1024 complex points per FFT per transfer  Most Load/Store cycles are done in parallel with the butterfly arithmetic  Moving data between PEs costs cycles Overview Architecture Applications Summary Q&A Benchmarking

16 HPEC Workshop September 24, 2003 16 Benchmarking  Measured Performance Ÿ Performance is measured in GFLOPS @200 MHz – Single FFT or IFFT counted as 5n*(log(n)) FLOPs – Complex Multiply is 6 FLOPs per multiply CodeCycles GFLOPS 256 PE 256 PE batch size GFLOPS 64 PE pilot chip Pilot chip batch size FFT13.9k23.5325.98 IFFT14.5k22.6325.68 CM1.7k23.14.5 FFT-CM-IFFT30.1k23.1325.88 FFT-CM-IFFT w I/O32k21.7325.48 Overview Architecture Applications Summary Q&A Benchmarking

17 HPEC Workshop September 24, 2003 17 Benchmarking  Measured Performance (cont.)  I/O transfers are to on-chip SRAM (mono memory) –For the pilot chip, this I/O is ~75% of processing time –Pilot chip bandwidth to off-chip memory is 50% lower  256 PE product will have ~3 GB/s off-chip mono transfers –Data transfer will be 100% overlapped with processing Overview Architecture Applications Summary Q&A Benchmarking

18 HPEC Workshop September 24, 2003 18 Benchmarking  Performance analysis (raw and per watt)  256 PEs @200MHz: peak of 102.4 GFLOPS –64 PE pilot chip @200MHz: peak of 25.6 GFLOPS  Current code achieving 23.5 GFLOPS (23% of peak)  In many applications, Performance/Watt counts most –256 PE product will be ~5 Watts based on GDSII measurements and other supporting data  256 PE extrapolated FFT performance (200 MHz) –~23.5 GFLOPS becomes ~4.7 GFLOPS/Watt –~0.9 M FFTs/sec/Watt (1024-point complex) Overview Architecture Applications Summary Q&A Benchmarking

19 HPEC Workshop September 24, 2003 19 Benchmarking  Application Timing Diagram Overview Architecture Applications Summary Q&A Benchmarking Cycle accurate simulations show that the FFT-CM-iFFT Compute and I/O phases completely overlap when double buffering is employed

20 HPEC Workshop September 24, 2003 20 Benchmarking  Other size FFTs  Spreading one FFT over more PEs increases data movement –Cost: lower peak GFLOPS –Benefit: Enables larger FFTs or lower latency per FFT Example (1024-point complex FFT on 64-PE pilot chip): Reducing batch size from 8 (8 PEs/FFT) to 1 (64 PEs/FFT), changes performance from ~5.7 to ~2.5 GFLOPS, but compute time reduces by ~2.6x Overview Architecture Applications Summary Q&A Benchmarking

21 HPEC Workshop September 24, 2003 21 Benchmarking  Other size FFTs (cont.)  8k-point FFT performance extrapolation: –~2.9 GFLOPS on pilot chip (batch size = 1) –~11.6 GFLOPS on 256 PE (batch size = 4)  128-point FFT performance extrapolation: –~31 GFLOPS on 256 PE product (batch size = #PEs) Overview Architecture Applications Summary Q&A Benchmarking 256 PE product under development can deal with FFTs up to 32k points without intermediate I/O However, >8k points may use intermediate I/O for speed This enables unlimited FFT sizes

22 HPEC Workshop September 24, 2003 22 Benchmarking  Library FFT development  WorldScape developing optimized FFT (and other) libraries –Start with components callable from Cn –Within-PE multiple short FFTs –Inter-stage twiddle multiplies –Across-PE data reorganization  VSIPL interface planned to maximize portability  Seeking industry and University partners Overview Architecture Applications Summary Q&A Benchmarking

23 HPEC Workshop September 24, 2003 23 Benchmarking  Faster in the Future  New microcoded instructions –Better handling of FP units’ pipeline for multiple butterflies –Split into Startup, multiple Middle, and End instructions –Each Middle does an extra butterfly faster than now –Separate multiplies from add/subtracts –Enables higher FFT radices –Saves first stage multiplies  Speedup –Within-PE: ~100%, to ~62 GFLOPS for 256 PEs (~60% of peak) –1024-point FFT in 8 PEs: ~60%, to ~36 GFLOPS Overview Architecture Applications Summary Q&A Benchmarking

24 HPEC Workshop September 24, 2003 24 Benchmarking  Technology Roadmap  Faster clock speeds  Faster/smaller memory libraries  More PE’s/chip  Scalable I/O and chip-chip connections  More memory/PE & mono memory  Custom cell implementation –<50% size & power (leverage from parallelism)  Future fabrication processes –Embedded DRAM, smaller line widths…. Overview Architecture Applications Summary Q&A Benchmarking

25 HPEC Workshop September 24, 2003 25 Applications  FFT/Pulse Compression Application  1K complex 32-bit IEEE floating point FFT/IFFT  Lockheed Martin “Baseline” –Mercury MCJ6 with 400-MHz PowerPC 7410 Daughtercards with AltiVec enabled, using one compute node –Pulse Compression implementation using VSIPL: FFT, complex multiply by a stored reference FFT, IFFT Overview Architecture Applications Summary Q&A Benchmarking

26 HPEC Workshop September 24, 2003 26 Applications  FFT/Pulse Compression Application (cont.)  ClearSpeed CASIM (Cycle Accurate SIMulator) –64 PEs simulated at 200 MHz –1K FFT or IFFT on 8 PEs with 128 complex points per PE –Pulse Compression using custom instructions: FFT, complex multiply by a stored reference FFT, IFFT –32-bit IEEE standard floating point computations Overview Architecture Applications Summary Q&A Benchmarking

27 HPEC Workshop September 24, 2003 27 Applications  Performance Comparison Function Mercury G4 (LM Measured) Mercury G4 (Published) WorldScape/ ClearSpeed 64 PE Chip Speedup vs. Mercury G4 (LM Measured) Speedup vs. Mercury G4 (Published) FFT39.0 µs25.0 µs*8.7 µs4.5 X2.9 X Complex Multiply 10.8 µsN/A1.1 µs7.7 X----- IFFT34.7 µsN/A9.2 µs3.8 X----- Pulse Compression 152.2 µsN/A20.0 µs7.6 X----- * Mercury published figure for time to complete a 1K FFT on 400-MHz PowerPC 7410 Overview Architecture Applications Summary Q&A Benchmarking

28 HPEC Workshop September 24, 2003 28 Applications  Power Comparison ProcessorClockPower FFT/sec /WattPC/sec/ Watt Mercury PowerPC 7410 400 MHz8.3 Watts3052782.2 WorldScape/ ClearSpeed 64 PE Chip 200 MHz2.0 Watts**5687024980 Speedup---- 18.6 X31.9 X **Measured Using Mentor Mach PA Tool Overview Architecture Applications Summary Q&A Benchmarking

29 HPEC Workshop September 24, 2003 29 Applications  Power Comparison (cont.) WorldScape/ClearSpeed 64 PE ChipTotal Cycles** FFT14067 Complex Multiply1723 IFFT14654 Pulse Compression32026 ** As simulated using CASIM at 200 MHz Overview Architecture Applications Summary Q&A Benchmarking

30 HPEC Workshop September 24, 2003 30 Summary  Massively Parallel Array Processing Architecture (MTAP) presented  Programmable in C  Robust IP and tools  Applicability to wide range of applications  Inherently scalable  On chip, chip-to-chip  Low Power  New standard for performance AND performance per watt Overview Architecture Applications Summary Q&A Benchmarking

31 HPEC Workshop September 24, 2003 31 Questions and Answers Overview Architecture Applications Summary Q&A Benchmarking


Download ppt "An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 24, 2003 Authors Ken Cameron"

Similar presentations


Ads by Google