Scientific requirements and dimensioning for the MICADO-SCAO RTC

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

*University of Utah † Lawrence Berkeley National Laboratory
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
Some Thoughts on Technology and Strategies for Petaflops.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
MMT Real-Time-Reconstructor. Hardware CPU: Quad-core Xeon 2.66 GHz RAM: 2GB OS: CentOS with RTAI real-time extensions Frame Grabber: EDT PCI-DV.
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
© 2010 Altera Corporation—Public DSP Innovations in 28-nm FPGAs Danny Biran Senior VP of Marketing.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Fan Zhang, Yang Gao and Jason D. Bakos
1 TMT.AOS.PRE DRF01 Design and Testing of GPU Server based RTC for TMT NFIRAOS Lianqi Wang AO4ELT3 Florence, Italy 5/31/2013 Thirty Meter Telescope.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Personal Chris Ward CS147 Fall  Recent offerings from NVIDA show that small companies or even individuals can now afford and own Super Computers.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
High Performance Computing
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
ALPAO ACEfast RTC Armin Schimpf, Mickael Micallef, Julien Charton RTC Workshop Observatoire de Paris, 26/01/2016.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
SGI Rackable C2108-GP5 “Arcadia” Server
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
A next-generation many-core processor with reliability, fault tolerance and adaptive power management features optimized for embedded.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Computer Graphics Graphics Hardware
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
LabVIEW Real Time for High Performance Control Applications
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Yang Gao and Dr. Jason D. Bakos
TI Information – Selective Disclosure
Employing compression solutions under openacc
Low-Cost High-Performance Computing Via Consumer GPUs
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
FPGA Cluster MVM reconstruction Scalable multiple FPGA architecture
Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node
Richard Dorrance Literature Review: 1/11/13
Texas Instruments TDA2x and Vision SDK
Low-Cost High-Performance Computing Via Consumer GPUs
Map-Scan Node Accelerator for Big-Data
Electronics for Physicists
Spartan FPGAs مرتضي صاحب الزماني.
Embedded OpenCV Acceleration
Ray-Cast Rendering in VTK-m
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Introduction to CUDA Programming
Footer.
Discussion HPC Priority project for COSMO consortium
Computer Graphics Graphics Hardware
GPU accelerated application tracing
Presentation transcript:

Scientific requirements and dimensioning for the MICADO-SCAO RTC

MICADO SCAO in few numbers Framerate : 1000Hz 1 SH WFS : 840x840 pixels @1kHz ~11.3Gb/s 80x80 ssp, let say ~10k measurements 1 DM : ~5k commands Command matrix : ~50Mb

MICADO-SCAO RTC Hard real-time subsystem: Baseline, linear control with matrix-vector multiply MVM Performance : 100 Gflops Determinism : 10% jitter over a 1000 FPS framerate : 100µs requirement Technology trade off study : CPU / GPU / FPGA Supervision module : optimization of hard-real time parameters Main task : regular control matrix update SVD Performance : 750Gflop (but memory-bound) Addressed with a standard HPC cluster

GPU Compute performance For the computation part, 1 GPU is enough : But, MVM is a io bound problem, the memory bandwidth the first performance indicator. NGPU = CEIL( CR / (B * FPP) ) = 2 GPUs K40 100 Gflops / (1880 Mb/s * 32) = 1.7 Model Architecture Theoretical peak performance single precision (TFLOPS) Memory bandwidth (Gb/s) ECC Energy consumption (W) NVidia Tesla K40 Kepler 4.29 2304 Yes 235 NVidia Tesla K80 8.5 3840 300 NVidia M6000 Maxwell 6.07 2539 No 222 K20C K40 K80 M6000 B theo 1664 2304 1920 (x2) 2539 B ECC off 1400 (84%) 1888 (82%) 1536 (x2, 80%) 1968 (77%) B ECC on 1200 (72%) 1664 (72%) 1360 (x2, 70%) NONE

GPU Compute performance Need for tailored data acquisition process See Denis Perret & Julien Bernard presentations tomorrow

CPU Compute performance Intel Xeon CPU / Xeon Phi Between 100 & 280 GB/s memory bandwidth per processing units Up to 5 CPUs or 2 Xeon Phis (see David Barr presentation tomorrow) Reference solution using standard libraries and portable code Need vectorization on Xeon Phi to meet peak performance Long term availability of Xeon Phis questionable

FPGA Compute performance ARRIA 10AS066 SoC 1.5GHz ARM dual-core Cortex-A9 co-processor 1855 DSP blocks 5.2MB int. RAM max. 48 XCVR links ARRIA 10AX115 1518 DSP blocks 6.6MB int. RAM max. 96 XCVR 40 GMAC/s peak performance (80 GFLOP/s) 2 boards are needed to meet performance specifications

ESO specific constraints ESO is gathering instrument specifications to build a standardized AO RTC approach (SPARTA evolution ?) ESO will define a RTC toolkit for instrument consortia Need to take into account E-ELT specificities : e.g. M4 control, handover between telescope control system and AO system … Instrument calibration issues : limited daytime slots for instrument calibration may impose efficient strategies with strong computing constraints

Conclusions MICADO Preliminary Design Phase has started Dimensioning a RTC for the SCAO mode Will lead trade off studies for RTC technologies down selection Interaction with ESO needed to be compliant with RTC toolkit and design full RTC pipeline including M4 control / interaction with telescope Output of Green Flash project (Damien Gratadour presentation) critical for design consideration