Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

Lecture 6: Multicore Systems
©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
A supercomputer-based software correlator at the Swinburne University of Technology Tingay, S.J. and Deller, A.
Low-Frequency Pulsar Surveys and Supercomputing Matthew Bailes.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
ASKAP Central Processor: Design and Implementation Calibration and Imaging Workshop 2014 ASTRONOMY AND SPACE SCIENCE Ben Humphreys | ASKAP Software and.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Status of Korean Participation in the SKA Project Minho Choi (KASI) SKA Japan Workshop NAOJ, Mitaka.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Paul Alexander & Jaap BregmanProcessing challenge SKADS Wide-field workshop SKA Data Flow and Processing – a key SKA design driver Paul Alexander and Jaap.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
Next Generation Digital Back-ends at the GMRT Yashwant Gupta Yashwant Gupta National Centre for Radio Astrophysics Pune India CASPER meeting Cambridge.
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
DiFX Performance Testing Chris Phillips eVLBI Project Scientist 25 June 2009.
An FX software correlator for VLBI Adam Deller Swinburne University Australia Telescope National Facility (ATNF)
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
A real-time software backend for the GMRT : towards hybrid backends CASPER meeting Capetown 30th September 2009 Collaborators : Jayanta Roy (NCRA) Yashwant.
Interconnection network network interface and a case study.
High Performance Computing
Correlator Options for 128T MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory
ATCA GPU Correlator Strawman Design ASTRONOMY AND SPACE SCIENCE Chris Phillips | LBA Lead Scientist 17 November 2015.
1 Next Generation Correlators, June 26 th −29 th, 2006 The LOFAR Blue Gene/L Correlator Stichting ASTRON (Netherlands Foundation for Research in Astronomy)
Cross correlators are a highly computationally intensive part of any radio interferometer. Although the number of operations per data sample is small,
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Square Kilometre Array eInfrastructure: Requirements, Planning, Future Directions Duncan Hall SPDO Software and Computing EGEE 2009.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
VGOS GPU Based Software Correlator Design Igor Surkis, Voytsekh Ken, Vladimir Mishin, Nadezhda Mishina, Yana Kurdubova, Violet Shantyr, Vladimir Zimovsky.
SGI Rackable C2108-GP5 “Arcadia” Server
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
The Science Data Processor and Regional Centre Overview Paul Alexander UK Science Director the SKA Organisation Leader the Science Data Processor Consortium.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
NIIF HPC services for research and education
Korea Astronomy and Space Science Institute
Multi-/Many-Core Processors
Constructing a system with multiple computers or processors
Multicore and GPU Programming
Presentation transcript:

Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators: Paul Alexander (Univ. of Cambridge) Andrew Faulkner (Univ. of Cambridge) Bong Won Sohn (KASI, Korea) Minho Choi (KASI, Korea) Dongsu Ryu (Chungnam Nat’l Univ., Korea)

PrepSKA 7 th EU Framework Programme 01/04/2008 – 31/03/2011(?) institutions 09/2009 Manchester WP2 meeting –AA, FPAA, WBSF –Power Issue –Correlator

Work Packages

Participating institutions

Correlators for Radio Interferometry ASIC (Application-Specific Integrated Circuit) FPGR (Field-Programmable Gate Arrays) Software (high level-languages, e.g., C/C++) –Rapid development –Expandability –…

Current Status of SC LBA (Australian Long Baseline Array) –8 antennas (Parkes, … 22-64m, GHz) –DiFX software correlator (2006; Deller et al. 2007) VLBA (Very Long Baseline Array) –10 antennas (25m, 330MHz - 86GHz) –DiFX (2009) GMRT (Giant Metrewave Radio Telescope) –30 antennas (45m, 50MHz-1.5GHz), 32MHz –ASIC  software correlator (Roy et al. 2010)

Current Status of SC (cont.) LOFAR (Low Frequency Array) –LBA (Low Band Antennae) 10-90MHz –HBA (High Band Antennae) 110 – 250MHz –IBM BlueGene/P: software correlation

SKA Timeline

Phase I SKA ( ) 10% of the final collecting area ~300 dishes software correlator

Correlation Theorem, FX-correlator F-step (FT): ~log 2 (N c ) operations per sample X-step (CMAC): ~ N operations per sample

FLOPS of the X-step in FX correlator 4 is from 8 is from 4 multiplications and 4 additions: N(N+1)/2 is the number of auto- and cross- correlations with antenna N if B (bandwidth) = 4 GHz N=2  96x4 GFLOPS = 384 GFLOPS N=100  16x4x10 4 GFLOPS = 640 TFLOPS N=300  16x9x4x10 4 GFLOPS = 5.76 PFLOPS N=3000  16x9x4x10 6 GFLOPS = 576 PFLOPS

Top500 supercomputer

Fermi Streaming Multiprocessor Fermi features several major innovations: 512 CUDA cores NVIDIA Parallel DataCache technology NVIDIA GigaThread™ engine ECC support –Performance: > 2 Tflops

Design goals Connect antennas and computer nodes with simple network topology Use future technology development of HPC clusters Simplify programming

CPUs+(GPUs) Required BW>512Gb/s =64GB/s 100 Gb/s Ethernet Software Correlator for Phase I SKA (‘2018) 2x4x2x4GHz =64Gb/s 2 pols, 4bit sampling, Nyquist, BW 300 nodes >20 TFLOPS

Communication between Computer Nodes I Node 1 FFT NcNc 1 2 after FFT NcNc 2x4x2xB 2x32x2xB Node 2 FFT

Communication between Computer Nodes II Node 1 FFT 1 2 after communication N c / x4x2xB 2x32x2xB Node 2 FFT N c /2

All-to-All Communication between Computer Nodes Node 1 FFT 1 2 after communication N c / x4x2xB 2x32x2xB Node 2 FFT N c /4 Node 2 FFT Node 2 FFT N>>1 BW(interconnect) =8 x 2x4x2xB =128B

Data Flows GPU, CMAC memory CPU, FFT 4  32bits bit/sample GPU Memory BW: 102GB/s (C1060) threads PCI-e BW: 8GB/s (gen2 16x) GPU, CMAC memory CPU, FFT 4  32bits bit/sample threads Node Interconnect BW: 40Gb/s (Infiniband)

AI (Arithmetic Intensity) Definition: number of operations (flops) per byte AI = 8flops/16bytes (R i,R j ) = 0.5 AI = 32 flops/32bytes (R i, L i, R j, L j ) = 1.0 for 1x1 tile AI = 2.4 for 3x2 tiles Since AIs are small numbers, correlation calculations are bounded by the memory bandwidth. Performance: AI x memory BW (=102GB/s) Wayth+ (2009), van Nieuwpoort+ (2009)

Measured bandwidth of host-to-device (Tesla C1060) ~70% of PCI -e2 bandwidth

Performance of Tesla C1060 as a function of AI Performance is, indeed, memory-bounded. Maximum performance is about 1/3 of the peak performance.

AI for host-device and host-host AI = (N+1) FLOP/byte N x 16 bytes (R,L) = 16 N byte 4x8xN(N+1)/2 FLOP PCI bus bandwidth PCI-e 2.0: 8.0GB/s (15 Jan. 2007) PCI-e 3.0: 16.0GB/s (2Q 2010, 2011) PCI-e 4.0: 32.0GB/s (201?) PCI-e 5.0: 64.0GB/s (201?) Performance [GFLOPS] = PCI BW [GB/s] x AI [FLOP/B]

Benchmark: Performance 10GF 100GF 40Gb/s IB

Expected performance bounded by the BWs of PCI bus and interconnect

Power usage and Costing Computer nodes –1.4 KW, 4 K Euro for each server including 2x0.236 KW (2 GPUs) –0.4 MW, 1.2 M Euro for 300 servers Network Switches –3.8 KW for IB (40Gb) 328 ports

Technology Development in Q 2010, (2011): PCI-e 3 rd Generation 2Q (April) 2010: Nvida Fermi (512 cores, L1,L2 cache) (March 29) 2010: AMD 12 core Opteron 6100 processor (March 30) 2010: Intel 8 core Nehaem-EX Xeon processor

Infiniband Roadmap (IBTA)

CPUs+(GPUs) Required BW>512Gb/s =64GB/s 100 Gb/s Ethernet Software Correlator for Phase I SKA (‘2018) 2x4x2x4GHz =64Gb/s 2 pols, 4bit sampling, Nyquist, BW 300 nodes >20 TFLOPS