Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6.

Similar presentations


Presentation on theme: "Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6."— Presentation transcript:

1 Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6

2 Current Status Correlator Hardware Inventory 10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s 2 full-size card cages + 1 small, with power supplies e2e simulation software file input packets  module  file output packets PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes

3 Current Status (cont’d) CB FPGA Firmware 32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available 512T: error-free CMAC (only) code for 115 cells working at 180 MHz

4 128T Correlator Requirements 30.72 MHz BW in 24 coarse channels of 1.28 MHz 256 inputs 16 Rx’s with 48 fibres 82.6 Gb/s aggregate bit rate ~32K correlation products F stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels) X stage: 1.01 TCMAC/s KEY (compared to 32T) same x4 x16

5 Top Level Choices hardware: use current hardware, developing FPGA firmware as necessary software: get RX signals into standardized format (10 gigE) ASAP, do PFB and correlation in GPU-equipped server hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software

6 Hardware Solution Using existing 32T firmware it should take 4 PFB boards and 16 CB’s, but architecture doesn’t scale in a fully- parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s! using tested CMAC design (115 cells @ 180 MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s

7

8

9 Brute Force 32T extension Group fibres into 3 sets of 16, each covering 8 coarse channels Replicate each fibre signal into 5 copies Use covering table to bring all pairs together Requires 20 complete board sets – massive (x5) redundancy of PFB’s 2 3 7 9 6 7 8 10 1 7 15 16 1 3 12 14 1 2 6 13 1 9 10 11 2 4 10 12 1 4 5 8 10 13 14 16 8 9 12 16 3 8 11 13 2 5 11 16 5 7 12 13 3 4 6 16 5 6 9 14 2 8 14 15 4 9 13 15 6 11 12 15 4 7 11 14 3 5 10 15

10 20 CB Hardware Assessment PRO very little FPGA design work on PFB system interfaces all tested and working use is made of all purpose- built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$) another build of ~10 PFB’s (and PFB/RTM’s) non-trivial changes to FPGA code on CB’s to implement an LTA

11 18 CB System split system into thirds, each getting 8 coarse chans each PFB gets 8 input fibres (need to do deskew) routing logic on CB’s changes, CMAC’s same

12 18 CB Hardware Assessment PRO relatively minor FPGA design work on PFB modest amount of change to FPGA code on CB’s system interfaces all tested and working use is made of all purpose- built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$)

13 8 CB System Each PFB gets 6 input fibres total, from 2 Rx’s Each PFB outputs to 8 different CB’s CB uses CMAC design from 512T at only 80% of achieved speed CB needs some cleverness in allocating cells to CMAC chips LTA could be skipped due to low output rate (10 Hz dump rate)

14 8 CB Hardware Assessment PRO no additional cost for hardware relatively minor FPGA design work on PFB system interfaces all tested and working use is made of all purpose- built boards CON significant amount of modified FPGA code on CB

15 Software Solution Put Rx coarse channel data into 10 gigE packets, by (e.g.) modifying AgFo design OTS programmable modules (a la 2PIP) F stage in host servers or GPU’s Do X stage in multiple GPU’s

16 GPU Correlation Wayth et al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel

17 GPU Bottlenecks NIC input rate max of 7 or 8 Gb/s to Host Host  Device BW (set by PCIe bus) PCI gen 2 x16 spec max of 8 GB/s Global memory processor BW spec max for C2050 is 144 GB/s Multiply & accumulate rate spec max for C2050 is 1.01 Tflops (single prec or 32 bit int)

18 Software Assessment PRO greatest flexibility, as all code is in software switched topology allows good match between # of servers and load easily expandable CON format conversion to 10 gigE will require some mixture of hardware acquisition and FPGA coding acquisition cost of GPU- equipped servers

19 Hybrid System modified PFB output stage in INF chip forms 10 gigE packets 4 lanes through CX-4 connector to unidirectional optical transceiver GPU-equipped servers only do 4+4 bit cross mult & sum 8 PFB’s used 6 inputs each 1 stream of 8 Gb/s per PFB output more real-estate

20 Incremental Hardware ~ 10 – 12 Supermicro 6016GT 1U servers, with Tesla C2050, 10 gigE NIC, memory, disk ~6 K$ apiece Cisco Catalyst 4900M with plug-ins for 24 ports ~10 K$ transceivers, fibres or cables

21 Hybrid Assessment PRO little additional cost to convert data to 10 gigE minimal FPGA design work relieves GPU of filtering burden switched topology allows good match between # of servers and load easily expandable CON some risk in unidirectional 10 gigE transceiver mods acquisition cost of GPU- equipped servers

22 Level of Effort - none/modest/significant


Download ppt "Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6."

Similar presentations


Ads by Google