Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2003 Mercury Computer Systems, Inc. Beamforming for Radar Systems on COTS Heterogeneous Computing Platforms Jeffrey A. Rudin Mercury Computer Systems,

Similar presentations


Presentation on theme: "© 2003 Mercury Computer Systems, Inc. Beamforming for Radar Systems on COTS Heterogeneous Computing Platforms Jeffrey A. Rudin Mercury Computer Systems,"— Presentation transcript:

1 © 2003 Mercury Computer Systems, Inc. Beamforming for Radar Systems on COTS Heterogeneous Computing Platforms Jeffrey A. Rudin Mercury Computer Systems, Inc. High Performance Embedded Computing (HPEC) Conference September 23, 2003

2 © 2003 Mercury Computer Systems, Inc. 2 Outline l Beamforming Radar System Architecture l Processing Resources l Strawman System Analysis wFront-End Processing wBack-End Processing wBeamformer Architectures l Summary

3 © 2003 Mercury Computer Systems, Inc. 3 Radar System Architecture l Beamforming requires massive dataflow and computation wADC precision and data rate are chosen to provide high dynamic range and and wide signal bandwidth wHigh number of input channels required in modern phased array radars to produce multiple beams and nulls Digital Memory Pulse Compression Adaptive Beamformer ADCFront-End RF Combiner Pulse Compression Adaptive Beamformer ADCFront-End RF Combiner Pulse Compression Adaptive Beamformer ADCFront-End RF Combiner Pulse Compression Adaptive Beamformer ADCFront-End RF Combiner Pulse Compression Adaptive Beamformer ADCFront-End RF Combiner Sub-Array Beamformer ANALOGFPGA  -P

4 © 2003 Mercury Computer Systems, Inc. 4 Processing Resources l Microprocessors wFixed processing, I/O, and memory architecture wTask context switch requires microseconds wNative floating-point available wLow interaction between code modules l FPGAs wCustomizable processing, I/O, and memory architecture wTask context switch requires reconfiguration -- milliseconds wFloating-point must be built or bought wConsiderable interaction between IP cores wSignal propagation issues wCurrently harder to program than microprocessors

5 © 2003 Mercury Computer Systems, Inc. 5 PowerPC Microprocessor l 400 - 1000 MHz clock speeds l 133 MHz system bus (MPC74xx) -- 851 MB/s l 64-bit integer and floating-point units l 128-bit AltiVec vector processing unit l Pipelined instruction unit l 32 kB instruction and data caches l Up to 2 MB L2 cache INSTRUCTION MMUDATA MMU DATA CACHE INSTRUCTION CACHE L2 CONTROLLER LOAD/STORE UNIT BUS INTERFACE UNIT MEMORY SUBSYSTEM MEMORY CONTROL UNIT COMPLETION UNIT DISPATCH UNIT BRANCH PROCESSING INSTRUCTION UNIT VECTOR ALUINTEGER ALUFLOATING-POINT ALU ARITHMETIC UNIT

6 © 2003 Mercury Computer Systems, Inc. 6 Device Gigibit Tx/Rx Logic Slices 18-Bit Multiplier 18K-Bit Block RAM Clock Manager I/O Pads CPU Blocks XC2VP401219,392192 88042 XC2VP501623,616232 88522 XC2VP702033,088328 89962 XC2VP1002044,096444 121,1642 Virtex-II Pro FPGA l Clock speeds lower than processors: 100 - 200 MHz clocks l Up to 20 full-duplex multi-gigabit transceivers. l Many DSP supporting features PowerPC 405 CORE FULL DUPLEX TRANCEIVERS DUAL-PORT BLOCK RAM CONFIGURABLE LOGIC CLOCK MANAGERS I/O BLOCKS DEDICATED MULTIPLIERS REGISTERS LUT’S CARRY LOGIC MULTIPLEXERS DISTRIBUTED RAM SHIFT REGISTERS Each block RAM contains two banks with independent sets of address and data lines Gigabit transceivers provide over 240 MBps each direction -- over 4800 MBps throughput!

7 © 2003 Mercury Computer Systems, Inc. 7 Strawman System Requirements l Lots of channels -- 80+ input channels l ADC with “good” bandwidth and dynamic range w100 MSps -- 1.56 - 25 MHz bandwidth using f s /4 sampling w14-bit precision -- over 80 dB dynamic range l Reasonable implementation risk -- 100 MHz clock ADC precision and rate and number of channels drive downstream requirements

8 © 2003 Mercury Computer Systems, Inc. 8 Front-End Processing l Digital Down Converter wfs/4 IF & BW w4x decimation w31-tap complex FIR, real symmetric coefficients wUsually no bit growth l Lowpass Decimation Filter w1x (bypass), 2x, 4x, 8x, and 16x decimation rates w0, 16, 32, 64, 128 taps wReal coefficients w0 to 2 bits of bit growth l Equalizer w16-tap, complex coefficients -- cannot generally exploit symmetry wUsually no bit growth Eliminates the need for numerically controlled oscillators (NCO)

9 © 2003 Mercury Computer Systems, Inc. 9 Digital Down Converter l Reduce complexity -- exploit fs/4 center frequency and bandwidth wComplex mixing reduces to polyphase commutation Cosine and sine select even and odd samples respectively –cos(jn  /4) = 1, 0, -1, 0, 1,…; sin(jn  /4) = 0, j, 0, -j, 0,… wExploit polyphase structure for decimation h3h3 h7h7 h 11 h 15 h 11 h7h7 h3h3 h0h0 h4h4 h8h8 h 12 h 14 h 10 h6h6 h2h2 h1h1 h5h5 h9h9 h 13 h9h9 h5h5 h1h1 h2h2 h6h6 h 10 h 14 h 12 h8h8 h4h4 h0h0 + + I Q POLYPHASE fs/4 DDC Odd number of taps creates symmetries in the FIR coefficients

10 © 2003 Mercury Computer Systems, Inc. 10 Digital Down Converter l Reduce complexity -- exploit filter symmetries ++++  ++++  Exploit symmetric filter structures for in-phase signal Exploit symmetry pair filters for quadrature signal Each tap calculation involves one coefficient and two samples

11 © 2003 Mercury Computer Systems, Inc. 11 Digital Down Converter l Reduce complexity -- exploit 4x decimation wUse MAC-Engine to do 4 multiplies per input sample Use fclk = 4 x fs to time share multipliers wConfigure logic slices as shift registers (SRL’s) to save BRAM Need to store 3 sets of numbers -- need 2 BRAM’s –Save BRAM by using logic slices to store both sets of samples Symmetric Filter Symmetry Filter Pair BRAM SRL-16 + h[n] x[n] + BRAM SRL-16 h[n] x 2 [n] x 1 [n] + +

12 © 2003 Mercury Computer Systems, Inc. 12 Low Pass Filter l Reduce complexity -- use MAC-Engine FIR implementation wRun multipliers at 4x sample rate -- time share multipliers wExploit constant length-decimation product Single structure handles multiple filter implementations Single clock frequency wUse dual-bank feature of BRAM First bank stores samples Second bank stores FIR coefficients h[n] x[n] + BRAM h[n] + BRAM + MAC-Engine

13 © 2003 Mercury Computer Systems, Inc. 13 Equalizer l Reduce complexity -- reduce number of multipliers and BRAM’s wExploit f clk /f s -- use MAC-Engine wImplement complex multiply using only 3 MAC-Engines Use common product term in complex multiply hrhr + xixi (h r - h i ) (h r + h i ) xrxr + + + + yryr yiyi + + + + yryr yiyi xrxr xixi + hihi hrhr + + Trade logic slices for multipliers Trade logic slices for block RAM

14 © 2003 Mercury Computer Systems, Inc. 14 Front-End Realization l FPGA features can be exploited to maximize utilization wUp to 20 100-MSps channels per FPGA wDDC with 31-Tap FIR using only 3 multipliers/channel wLPF 16-128 Tap decimating FIR using only 4 multipliers/channel wEQU 16-Tap complex FIR using only 12 multipliers/channel Channels 1-20 200 MByte/s 20x 2.5 Gb FO From ADC’s Additional copy of each channel for distribution Channels 1-20 100 MByte/s 9x 2.5 Gb FO DDCEQULPF DDCEQULPF DDCEQULPF DDCEQULPF Digital Receiver Module for 20x 100 MSps Channels on Virtex-II Pro 100 Multipliers Block RamLogic Slices Processing I/O Memory Ctrl. Margin FPGA Utilization for 20x 100 MSps Channels HIGH FPGA UTILIZATION

15 © 2003 Mercury Computer Systems, Inc. 15 Back-End Processing l FPGAs can be used to address data flow requirements that persist in the system until application of adaptive beamforming weights wDigital Pulse Compression Fast convolution with FFT IP cores wDoppler Processing FPGA FFT IP cores available wAdaptive Beamforming Weight Application Similar advantages to those in sub-array beamformer l FPGAs can augment weight computation wQR Decomposition New FPGA solutions may replace microprocessors wCholesky Decomposition Possibly form covariance matrix in adjunct FPGA

16 © 2003 Mercury Computer Systems, Inc. 16 Digital Pulse Compression l FFT IP cores can be used to implement pulse compression w8192-tap FFT @ 25 MSps/channel w6 sub-array channels / FPGA w3-stage pipelined convolver -- 2 convolvers / FPGA wEnough resources to sum partial products from beamformer Memory MULFFTIFFT SUM Partial Product 1 Partial Product 2 Memory MULFFTIFFT I/O MultipliersBlock RamLogic Slices Processing I/O Memory Ctrl. Margin DIGITAL PULSE COMPRESSION FPGA UTILIZATION GOOD FPGA UTILIZATION FFT cores tend to be BRAM hungry. Doppler processing can be implemented using similar FFT cores

17 © 2003 Mercury Computer Systems, Inc. 17 Beamformer Architectures l Unconstrained Linear Architecture wAll input channels contribute to each output l Constrained Linear Architecture wA subset of input channels contributes to any output l Mesh Architecture wAll input channels contribute to each output

18 © 2003 Mercury Computer Systems, Inc. 18 Beamformer Module Constraints l Basic limits are imposed by I/O and number of multipliers l Inputs over 18-bits can increase the number of multipliers wKeep watch on bit growth in front-end processing

19 © 2003 Mercury Computer Systems, Inc. 19 Beamformer Module Constraints l Multiplexing must be designed to maximize communication wBeam Partitioned output multiplexing may reduce efficiency wAlternate multiplexing methods may be necessary Data can also be partitioned by link: each link carried an integral number of channels

20 © 2003 Mercury Computer Systems, Inc. 20 Unconstrained Linear Architecture l Full MxN unconstrained complex matrix multiply l Outputs only from a single module l Processing throughput limited by beamformer module I/O l Communication latency across beamformer is an issue l Additional beams can be produced by multiple passes on data wDecreases overall radar duty cycle wMemory should be located in digital beamformer to save I/O bandwidth wIncreased beamformer processing speed may be required Digital Rx Beamformer Memory PROC N Passes CPI PROC CPI Beams Input Sets

21 © 2003 Mercury Computer Systems, Inc. 21 Unconstrained Linear Architecture l Unconstrained linear beamformer module is I/O bound wTotal number of input links plus output links is constant wChoice of input to output balance affects utilization 1 10 100 1000 110100 Number of Input Channels Number of Output Channels COMPUTATIONAL LIMIT COMMUNICATION LIMIT USEABLE CONFIGURATIONS 4 input module 20 input module Note:adding additional non-MGT connections could potentially increase throughput

22 © 2003 Mercury Computer Systems, Inc. 22 4 Input Module Realization l I/O and compute bounds are not close -- low utilization l 36 x 96 unconstrained matrix multiply l 35 modules required for FPGA digital processor wFront-end – 5 modules wSmall-array beamformer – 24 modules wDigital pulse compression – - 6 modules LOW FPGA UTILIZATION BEAMFORMER MODULE UTILIZATION MultipliersBlock RamLogic Slices Processing I/O Memory Ctrl. Margin 96 Beamform Digital Rx 20 4 36 8.4 GBps5.9 GBps DPC

23 © 2003 Mercury Computer Systems, Inc. 23 20 Input Module Realization l I/O and compute bounds are close -- good utilization l 40 x 100 unconstrained matrix multiply l 22 modules required for FPGA digital processor wFront-end – 5 modules wSmall-array beamformer – 10 modules wDigital pulse compression – 7 modules GOOD FPGA UTILIZATION BEAMFORMER MODULE UTILIZATION MultipliersBlock RamLogic Slices Processing I/O Memory Ctrl. Margin 6.5 GBps 8.8 GBps 100 40 Digital Rx 20 Beamform 20 Beamform 20 Beamform 20 Beamform 20 Beamform 20 40 DPC

24 © 2003 Mercury Computer Systems, Inc. 24 Constrained Linear Architecture l Use each beamformer module to produce outputs l MxN constrained complex matrix multiply wUse only a subset of inputs for each output l I/O and computation bounds the as in the unconstrained case wInputs and outputs must be balanced to maximize utilization Digital Rx Beamformer Memory EXPLICIT ZEROS IN BEAMFORMING MATRIX

25 © 2003 Mercury Computer Systems, Inc. 25 20 Input Module Implementation l Adding matrix constraints increases the number of outputs l 50 x 100 constrained matrix multiply l 19 modules required for FPGA digital processor wFront-end - 5 modules wSmall-array beamformer – 5 modules wDigital pulse compression - 9 modules GOOD FPGA UTILIZATION BEAMFORMER MODULE UTILIZATION MultipliersBlock RamLogic Slices Processing I/O Memory Ctrl. Margin 100 50 Beamform Digital Rx 20 Beamform Digital Rx 20 Beamform Digital Rx 20 Beamform Digital Rx 20 Beamform Digital Rx 20 10 20 10 50 8.8 GBps8.1 GBps DPC

26 © 2003 Mercury Computer Systems, Inc. 26 Mesh Architecture l Mesh architecture offers utilization enhancement wI/O and computation bounds touch l Full unconstrained matrix multiply l Partially formed beams sent forward for summing in DPC 5 links 12 channels 5 links 12 channels 5 links 10 channels 5 links 10 channels 5 links 10 channels 5 links 10 channels 5 links 12 channels 5 links 12 channels 40 x 48 CMAC using 4 modules NO EXPLICIT ZEROS IN BEAMFORMING MATRIX Note: Computation limit normalized for architecture

27 © 2003 Mercury Computer Systems, Inc. 27 Mesh Implementation l I/O and compute bounds touch -- high utilization l 40 x 96 unconstrained matrix multiply l 20 modules required for FPGA digital processor wFront-end – 5 modules wSmall-array beamformer – 8 modules wDigital pulse compression – 7 modules HIGH FPGA UTILIZATION BEAMFORMER MODULE UTILIZATION MultipliersBlock RamLogic Slices Processing I/O Memory Ctrl. Margin 8.4 GBps6.5 GBps 9640 Digital Rx 96 Beamform 40 DPC

28 © 2003 Mercury Computer Systems, Inc. 28 Architecture Comparison l Mesh architecture gives highest multiplier utilization

29 © 2003 Mercury Computer Systems, Inc. 29 Large Systems l Large systems can be created through layering beamformers w8 beam system, 20 channels per beam -- 160 channels w160 x 96 unconstrained matrix multiply l 65 modules required for FPGA digital processor wFront-end - 5 modules wSmall-array beamformer – 32 modules wDigital pulse compression - 28 modules Channels 1-96 Channels 1-160

30 © 2003 Mercury Computer Systems, Inc. 30 Summary l FPGAs can provide efficient I/O and computational power to address high input bandwidths of modern radar systems. wFront-end processing wSub-array beamformer wDigital pulse compression wAdaptive beamforming l System topologies that provide efficient utilization of computational and I/O resources change dramatically as system requirements scale. wWatch I/O and computation bounds l Small changes in system requirements can dramatically increase complexity of FPGA implementations when computational bounds of embedded resources is exceeded. wWatch for symmetries in filters wWatch bit growth before 18-bit multipliers l FPGAs should be used until application of adaptive beamforming weights due to high bandwidth dataflow.


Download ppt "© 2003 Mercury Computer Systems, Inc. Beamforming for Radar Systems on COTS Heterogeneous Computing Platforms Jeffrey A. Rudin Mercury Computer Systems,"

Similar presentations


Ads by Google