Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Similar presentations


Presentation on theme: "A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions."— Presentation transcript:

1 A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions

2 MAPLD 2005/206Chiang 2 Presentation Outline Space born signal processing tasks FPOA architecture highlights programmability and expandability System partition on FPOA device Spatial processing - 5x5 filter solution Temporal processing – motion estimation Internal bus and I/O throughput Resource utilization and future expansion

3 MAPLD 2005/206Chiang 3 A System of Digital Signal Processing Data Extraction Input Data Spatial or Temporal Processing Frequency or Time domain Processing Feature Extraction Characterization mux/de-mux Average filter min/max select spatial edge filter temporal difference filter time domain low/high/bandpass filter frequency transformation frequency domain low/high/bandpass filter apply equation that defines feature checking threshold analyze and characterize signals

4 MAPLD 2005/206Chiang 4 Processing Requirements High computation requirement on the following basic operations: add/sub and mul/mac, Mixed control functions such as loop control and decision making High I/O bandwidth to enable balanced processing vs. data input/output Large and fast temporary memory space to facilitate real-time processing Fast programmable and direct data transfer enables massive parallel processing

5 MAPLD 2005/206Chiang 5 FPOA Architecture Summary Heterogeneous Array of 16-bit Silicon Objects ­MAC, ALU, Truth Tables, Register File, Internal RAM ­Single Clock Cycle Execution for All Objects Homogeneous 2-Layer Programmable Interconnect Mesh Tightly Integrated Data and Control Flow Integrated DDRII RLDRAM & SRAM Controllers High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL

6 MAPLD 2005/206Chiang 6 Reconfigurable Interconnect Network Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits Nearest Neighbors ­Range = 1 (N/E/S/W + diagonal) Party Lines ­Single cycle range = hop to 3 (skip 2) @ 1GHz ­Extra clock cycles for digital retiming 1 extra  25-object neighborhood More clock cycles  entire chip

7 MAPLD 2005/206Chiang 7 FPOA Solution Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth 400 Silicon Objects running at 1 GHz ­ALU: add/sub, and combinational logic ­MAC: mul/mac ­Register File (RF): fast distributed data storage ­Internal RAM (IRAM): intermediate data storage Party lines and muxes to support flexible internal bus as well as dedicated connections

8 MAPLD 2005/206Chiang 8 Example FPOA Partition

9 MAPLD 2005/206Chiang 9 5x5 Convolution Filter Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i

10 MAPLD 2005/206Chiang 10 Computation Requirements Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample The whole convolution filter operation requires 25 * M * N MAC operations With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second

11 MAPLD 2005/206Chiang 11 Data Storage 2D data storage in a 1D linear memory where 4 16- bit word can be accessed concurrently Example of an 8x8 2D matrix stored in a 1D memory

12 MAPLD 2005/206Chiang 12 Data Access Analysis Samples are stored in the external memory with slower access speed Maximize data bandwidth by accessing 4 words at a time Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation

13 MAPLD 2005/206Chiang 13 Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52

14 MAPLD 2005/206Chiang 14 FPOA Solution Temporary data storage ­5 RFs, 3 ALUs Data access control ­3 ALUs Multiplier ­4 MACs Adder Tree ­9 ALUs Temporary Results ­2 RFs, 1 IRAM, 2 ALUs

15 MAPLD 2005/206Chiang 15 5x5 Convolution Filter Performance FPOA Resources ­ALU:17 ­RF:7 ­MAC:4 ­IRAM:1 ­Total: 28 SOs + 1 IRAM Data throughput ­20 results every 125 cycles

16 MAPLD 2005/206Chiang 16 Motion Estimation Identify the movement of a similar pattern over time The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i

17 MAPLD 2005/206Chiang 17 SAD Computation Dataflow 3 cycles throughput Generates two partial sums of positive differences

18 MAPLD 2005/206Chiang 18 SAD Performance FPOA Resources ­ALU:35 ­RF:1 ­Total: 36 SOs Data throughput ­24 cycles per 8x8 block

19 MAPLD 2005/206Chiang 19 Internal System Bus Link all processing modules and the external host to the external memory for data accesses to the external system memory Host controlled round-robin access from module to module User defined package format to utilize the 16-bit party line and minimize the access overhead

20 MAPLD 2005/206Chiang 20 System Bus Implementation

21 MAPLD 2005/206Chiang 21 System Bus Performance FPOA Resources ­ALU: 20 Cycles ­XRAM read:4 cycles ­XRAM write:4 cycles ­Module switch:10 cycles

22 MAPLD 2005/206Chiang 22 Performance of an Example Space Satellite Application Processing Throughput ­About 10 Million Samples per second FPOA Resources (% of a device with 400 SOs and running at 400 MHz) ­Cycle utilization: 21% ­SO utilization:51% ­IRAM utilization:25% ­XRAM b/w:49% (100 MHz DDR RLDRAM)


Download ppt "A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions."

Similar presentations


Ads by Google