Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6.

Slides:



Advertisements
Similar presentations
Nios Multi Processor Ethernet Embedded Platform Final Presentation
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Programmable FIR Filter Design
Distributed Arithmetic
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
ESODAC Study for a new ESO Detector Array Controller.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Uli Schäfer 1 S-L1Calo upstream links architecture -- interfaces -- technology.
Uli Schäfer 1 (Not just) Backplane transmission options Upgrade will always be in 5 years time.
Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.
Processor Technology and Architecture
t Popularity of the Internet t Provides universal interconnection between individual groups that use different hardware suited for their needs t Based.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
P3K RTC Status Tuan Truong & Mitchell Troy P3K Team Meeting #4 7/19/2007.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GPGPU platforms GP - General Purpose computation using GPU
Motherboard AKA mainboard, system board, planar board, or logic board. It is printed circuit board found in all modern computers which holds many of the.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
SE-IR Corporation 11/04 Goleta, CA (805) CamIRa TM SE-IR Corporation 87 Santa Felicia Dr. Goleta, CA (805)
DARPA Digital Audio Receiver, Processor and Amplifier Group Z James Cotton Bobak Nazer Ryan Verret.
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
Wireless Sensor Monitoring Group Members: Daniel Eke (COMPE) Brian Reilly (ECE) Steven Shih (ECE) Sponsored by:
Computer Graphics Graphics Hardware
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
GBT Interface Card for a Linux Computer Carson Teale 1.
14 Sep 2005DAQ - Paul Dauncey1 Tech Board: DAQ/Online Status Paul Dauncey Imperial College London.
Understanding Data Acquisition System for N- XYTER.
NTD/xNTD Signal Processing Presented by: John Bunton Signal Processing team: Joseph Pathikulangara, Jayasri Joseph, Ludi de Souza and John Bunton Plus.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
Nov 3, 2009 RN - 1 Jet Propulsion Laboratory California Institute of Technology Current Developments for VLBI Data Acquisition Equipment at JPL Robert.
GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
WP5 – Wirespeed Photonic Firewall Validation Start M27, finish M35 Avanex lead Description of Work –Establish test bed suitable to validated the optical.
Short introduction Pulsar Parkes. Outline PDFB – Single beam pulsar timing system CASPER – Single beam pulsar coherent dedispersion system.
PART 7 CPU Externals CHAPTER 7: INPUT/OUTPUT 1. Input/Output Problems Wide variety of peripherals – Delivering different amounts of data – At different.
Local-Area Networks. Topology Defines the Structure of the Network – Physical topology – actual layout of the wire (media) – Logical topology – defines.
ATCA GPU Correlator Strawman Design ASTRONOMY AND SPACE SCIENCE Chris Phillips | LBA Lead Scientist 17 November 2015.
Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.
WP5 – Wirespeed Photonic Firewall Validation Start M27, finish M41(tbc) CIP now lead Description of Work –Establish test bed suitable to validated the.
Update on the New Developments & Analysis from Hawaii Larry Ruckman & Gary Varner Instrumentation Development Laboratory (ID-Lab) University of Hawaii.
ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;
ROD Activities at Dresden Andreas Glatte, Andreas Meyer, Andy Kielburg-Jeka, Arno Straessner LAr Electronics Upgrade Meeting – LAr Week September 2009.
CMX: Update on status and planning Yuri Ermoline, Wojciech Dan Edmunds, Philippe Laurens, Chip Michigan State University 7-Mar-2012.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
VGOS GPU Based Software Correlator Design Igor Surkis, Voytsekh Ken, Vladimir Mishin, Nadezhda Mishina, Yana Kurdubova, Violet Shantyr, Vladimir Zimovsky.
JIVE UniBoard Correlator External Review
Programmable Hardware: Hardware or Software?
HyperTransport™ Technology I/O Link
LFD Correlator MWA-LFD Kick-off Meeting San Diego 5-June-2006
Embedded Systems Design
JIVE UniBoard Correlator (JUC) Firmware
FPGAs in AWS and First Use Cases, Kees Vissers
Serial Data Hub (Proj Dec13-13).
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
CPU Structure CPU must:
SKAMP Square Kilometre Array Molonglo Prototype
Presentation transcript:

Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory

Current Status Correlator Hardware Inventory 10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s 2 full-size card cages + 1 small, with power supplies e2e simulation software file input packets  module  file output packets PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes

Current Status (cont’d) CB FPGA Firmware 32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available 512T: error-free CMAC (only) code for 115 cells working at 180 MHz

128T Correlator Requirements MHz BW in 24 coarse channels of 1.28 MHz 256 inputs 16 Rx’s with 48 fibres 82.6 Gb/s aggregate bit rate ~32K correlation products F stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels) X stage: 1.01 TCMAC/s KEY (compared to 32T) same x4 x16

Top Level Choices hardware: use current hardware, developing FPGA firmware as necessary software: get RX signals into standardized format (10 gigE) ASAP, do PFB and correlation in GPU-equipped server hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software

Hardware Solution Using existing 32T firmware it should take 4 PFB boards and 16 CB’s, but architecture doesn’t scale in a fully- parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s! using tested CMAC design ( MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s

Brute Force 32T extension Group fibres into 3 sets of 16, each covering 8 coarse channels Replicate each fibre signal into 5 copies Use covering table to bring all pairs together Requires 20 complete board sets – massive (x5) redundancy of PFB’s

20 CB Hardware Assessment PRO very little FPGA design work on PFB system interfaces all tested and working use is made of all purpose- built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$) another build of ~10 PFB’s (and PFB/RTM’s) non-trivial changes to FPGA code on CB’s to implement an LTA

18 CB System split system into thirds, each getting 8 coarse chans each PFB gets 8 input fibres (need to do deskew) routing logic on CB’s changes, CMAC’s same

18 CB Hardware Assessment PRO relatively minor FPGA design work on PFB modest amount of change to FPGA code on CB’s system interfaces all tested and working use is made of all purpose- built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$)

8 CB System Each PFB gets 6 input fibres total, from 2 Rx’s Each PFB outputs to 8 different CB’s CB uses CMAC design from 512T at only 80% of achieved speed CB needs some cleverness in allocating cells to CMAC chips LTA could be skipped due to low output rate (10 Hz dump rate)

8 CB Hardware Assessment PRO no additional cost for hardware relatively minor FPGA design work on PFB system interfaces all tested and working use is made of all purpose- built boards CON significant amount of modified FPGA code on CB

Software Solution Put Rx coarse channel data into 10 gigE packets, by (e.g.) modifying AgFo design OTS programmable modules (a la 2PIP) F stage in host servers or GPU’s Do X stage in multiple GPU’s

GPU Correlation Wayth et al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel

GPU Bottlenecks NIC input rate max of 7 or 8 Gb/s to Host Host  Device BW (set by PCIe bus) PCI gen 2 x16 spec max of 8 GB/s Global memory processor BW spec max for C2050 is 144 GB/s Multiply & accumulate rate spec max for C2050 is 1.01 Tflops (single prec or 32 bit int)

Software Assessment PRO greatest flexibility, as all code is in software switched topology allows good match between # of servers and load easily expandable CON format conversion to 10 gigE will require some mixture of hardware acquisition and FPGA coding acquisition cost of GPU- equipped servers

Hybrid System modified PFB output stage in INF chip forms 10 gigE packets 4 lanes through CX-4 connector to unidirectional optical transceiver GPU-equipped servers only do 4+4 bit cross mult & sum 8 PFB’s used 6 inputs each 1 stream of 8 Gb/s per PFB output more real-estate

Incremental Hardware ~ 10 – 12 Supermicro 6016GT 1U servers, with Tesla C2050, 10 gigE NIC, memory, disk ~6 K$ apiece Cisco Catalyst 4900M with plug-ins for 24 ports ~10 K$ transceivers, fibres or cables

Hybrid Assessment PRO little additional cost to convert data to 10 gigE minimal FPGA design work relieves GPU of filtering burden switched topology allows good match between # of servers and load easily expandable CON some risk in unidirectional 10 gigE transceiver mods acquisition cost of GPU- equipped servers

Level of Effort - none/modest/significant