A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions
MAPLD 2005/206Chiang 2 Presentation Outline Space born signal processing tasks FPOA architecture highlights programmability and expandability System partition on FPOA device Spatial processing - 5x5 filter solution Temporal processing – motion estimation Internal bus and I/O throughput Resource utilization and future expansion
MAPLD 2005/206Chiang 3 A System of Digital Signal Processing Data Extraction Input Data Spatial or Temporal Processing Frequency or Time domain Processing Feature Extraction Characterization mux/de-mux Average filter min/max select spatial edge filter temporal difference filter time domain low/high/bandpass filter frequency transformation frequency domain low/high/bandpass filter apply equation that defines feature checking threshold analyze and characterize signals
MAPLD 2005/206Chiang 4 Processing Requirements High computation requirement on the following basic operations: add/sub and mul/mac, Mixed control functions such as loop control and decision making High I/O bandwidth to enable balanced processing vs. data input/output Large and fast temporary memory space to facilitate real-time processing Fast programmable and direct data transfer enables massive parallel processing
MAPLD 2005/206Chiang 5 FPOA Architecture Summary Heterogeneous Array of 16-bit Silicon Objects MAC, ALU, Truth Tables, Register File, Internal RAM Single Clock Cycle Execution for All Objects Homogeneous 2-Layer Programmable Interconnect Mesh Tightly Integrated Data and Control Flow Integrated DDRII RLDRAM & SRAM Controllers High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL
MAPLD 2005/206Chiang 6 Reconfigurable Interconnect Network Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits Nearest Neighbors Range = 1 (N/E/S/W + diagonal) Party Lines Single cycle range = hop to 3 (skip 1GHz Extra clock cycles for digital retiming 1 extra 25-object neighborhood More clock cycles entire chip
MAPLD 2005/206Chiang 7 FPOA Solution Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth 400 Silicon Objects running at 1 GHz ALU: add/sub, and combinational logic MAC: mul/mac Register File (RF): fast distributed data storage Internal RAM (IRAM): intermediate data storage Party lines and muxes to support flexible internal bus as well as dedicated connections
MAPLD 2005/206Chiang 8 Example FPOA Partition
MAPLD 2005/206Chiang 9 5x5 Convolution Filter Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i
MAPLD 2005/206Chiang 10 Computation Requirements Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample The whole convolution filter operation requires 25 * M * N MAC operations With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second
MAPLD 2005/206Chiang 11 Data Storage 2D data storage in a 1D linear memory where bit word can be accessed concurrently Example of an 8x8 2D matrix stored in a 1D memory
MAPLD 2005/206Chiang 12 Data Access Analysis Samples are stored in the external memory with slower access speed Maximize data bandwidth by accessing 4 words at a time Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation
MAPLD 2005/206Chiang 13 Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52
MAPLD 2005/206Chiang 14 FPOA Solution Temporary data storage 5 RFs, 3 ALUs Data access control 3 ALUs Multiplier 4 MACs Adder Tree 9 ALUs Temporary Results 2 RFs, 1 IRAM, 2 ALUs
MAPLD 2005/206Chiang 15 5x5 Convolution Filter Performance FPOA Resources ALU:17 RF:7 MAC:4 IRAM:1 Total: 28 SOs + 1 IRAM Data throughput 20 results every 125 cycles
MAPLD 2005/206Chiang 16 Motion Estimation Identify the movement of a similar pattern over time The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i
MAPLD 2005/206Chiang 17 SAD Computation Dataflow 3 cycles throughput Generates two partial sums of positive differences
MAPLD 2005/206Chiang 18 SAD Performance FPOA Resources ALU:35 RF:1 Total: 36 SOs Data throughput 24 cycles per 8x8 block
MAPLD 2005/206Chiang 19 Internal System Bus Link all processing modules and the external host to the external memory for data accesses to the external system memory Host controlled round-robin access from module to module User defined package format to utilize the 16-bit party line and minimize the access overhead
MAPLD 2005/206Chiang 20 System Bus Implementation
MAPLD 2005/206Chiang 21 System Bus Performance FPOA Resources ALU: 20 Cycles XRAM read:4 cycles XRAM write:4 cycles Module switch:10 cycles
MAPLD 2005/206Chiang 22 Performance of an Example Space Satellite Application Processing Throughput About 10 Million Samples per second FPOA Resources (% of a device with 400 SOs and running at 400 MHz) Cycle utilization: 21% SO utilization:51% IRAM utilization:25% XRAM b/w:49% (100 MHz DDR RLDRAM)