1 TMT.AOS.PRE.13.086.DRF01 Design and Testing of GPU Server based RTC for TMT NFIRAOS Lianqi Wang AO4ELT3 Florence, Italy 5/31/2013 Thirty Meter Telescope.

Slides:



Advertisements
Similar presentations
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Advertisements

Chimera: Collaborative Preemption for Multitasking on a Shared GPU
National Research Council Canada Conseil national de recherches Canada Experimental Laboratory for Instrument Development in Astrophysics Jean-Pierre Véran.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
The Project Office Perspective Antonin Bouchez 1GMT AO Workshop, Canberra Nov
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
MMT Real-Time-Reconstructor. Hardware CPU: Quad-core Xeon 2.66 GHz RAM: 2GB OS: CentOS with RTAI real-time extensions Frame Grabber: EDT PCI-DV.
PALM-3000 ATST/BBSO Visit Stephen Guiwits P3K System Hardware 126 Cahill February 11, 2010.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Tomography for Multi-guidestar Adaptive Optics An Architecture for Real-Time Hardware Implementation Donald Gavel, Marc Reinig, and Carlos Cabrera UCO/Lick.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)
P3K RTC Status Tuan Truong & Mitchell Troy P3K Team Meeting #4 7/19/2007.
Panda: MapReduce Framework on GPU’s and CPU’s
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
TMT.SEN.PRE REL01 Development and validation of vibration source requirements for TMT to ensure AO performance Hugh Thompson and Doug MacMartin.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
Developing Performance Estimates for High Precision Astrometry with TMT Matthias Schoeck, Tuan Do, Brent Ellerbroek, Gilles Luc, Glen Herriot, Leo Meyer,
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
1 Manal Chebbo, Alastair Basden, Richard Myers, Nazim Bharmal, Tim Morris, Thierry Fusco, Jean-Francois Sauvage Fast E2E simulation tools and calibration.
NSF Center for Adaptive Optics UCO Lick Observatory Laboratory for Adaptive Optics Tomographic algorithm for multiconjugate adaptive optics systems Donald.
Real-time control system verification for ELT AO systems Alastair Basden 1, Richard Myers 1, Tim Morris 1, Ali Bharmal 1, Urban Bitenc 1, Nigel Dipper.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
IRIS OIWFS Concept Study D. Loop1, M. Fletcher1, V. Reshetov1, R
TMT.PMO.PRE REL011 Thirty Meter Telescope Background and Status.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
ICOM Noack Scheduling For Distributed Systems Classification – degree of coupling Classification – granularity Local vs centralized scheduling Methods.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
1 Latest Generations of Multi Core Processors
Evaluation of Astrometry Errors due to the Optical Surface Distortions in Adaptive Optics Systems and Science Instruments Brent Ellerbroek a, Glen Herriot.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
High Speed Detectors at Diamond Nick Rees. A few words about HDF5 PSI and Dectris held a workshop in May 2012 which identified issues with HDF5: –HDF5.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
ICALEPCS 2005 Geneva, Oct. 12 The ALMA Telescope Control SystemA. Farris The ALMA Telescope Control System Allen Farris Ralph Marson Jeff Kern National.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Paris RTC Workshop INSTITUTO DE ASTROFÍSICA DE CANARIAS Luis Fernando Rodríguez Ramos Instituto de Astrofísica de Canarias Real-time control with FPGA,
AO4ELT, Paris A Split LGS/NGS Atmospheric Tomography for MCAO and MOAO on ELTs Luc Gilles and Brent Ellerbroek Thirty Meter Telescope Observatory.
1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
Gemini AO Program March 31, 2000Ellerbroek/Rigaut [ ]1 Scaling Multi-Conjugate Adaptive Optics Performance Estimates to Extremely Large Telescopes.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
ALPAO ACEfast RTC Armin Schimpf, Mickael Micallef, Julien Charton RTC Workshop Observatoire de Paris, 26/01/2016.
KNU RTLAB A Real-Time Linux System For Autonomous Navigation And Flight Attitude Control Of An Uninhabited Aerial Vehicle Charles E. Hall, Jr. Mechanical.
June 21, 2016, Robert Karban Jet Propulsion Laboratory,
SSA Module Assembly ANNEX 0 - Assembly sequence
NFV Compute Acceleration APIs and Evaluation
Scientific requirements and dimensioning for the MICADO-SCAO RTC
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Presentation transcript:

1 TMT.AOS.PRE DRF01 Design and Testing of GPU Server based RTC for TMT NFIRAOS Lianqi Wang AO4ELT3 Florence, Italy 5/31/2013 Thirty Meter Telescope Project

2 TMT.AOS.PRE DRF01 Part of the on-going RTC architecture trade study for NFIRAOS –Handle MCAO and NGS SCAO Mode –Iterative algorithms Hardware: FPGA, GPU, etc. –Matrix Vector Multiply (MVM) algorithm Made possible by using GPUs to compute the control matrix Hardware: GPU, Xeon Phi, or just CPUS In this talk –GPU server based RTC design –Benchmarking the 10 GbE interface and the whole real time chain –Updating the control matrix as seeing changes Introduction

3 TMT.AOS.PRE DRF01 LGS WFS Reconstruction: –Processing pixels from 6 order 60x60 LGS WFS 2896 subapertures per WFS. Average 70 pixels per sub-aps.(0.4MB per WFS, 2.3MB total) CCD read out: 0.5 ms. –Compute DM commands from gradients 6981 active (7673 total) DM actuators from 35k grads Using Iterative algorithms or matrix vector multiply (MVM). –At 800 Hz. All has to finish in ~1.25ms This talk focuses on MVM, in GPUs. Most Challenging Requirements

4 TMT.AOS.PRE DRF01 Number of GPUs for Applying MVM NGPU Memory MB Compute GFlops Device Mem. GB/s PCI-E GB/s GTX Theory: 192 Reality: Xeon Phi Theory: 320 Reality: RedNo achievable YellowNearly achievable GreenAchievable 925 MB control matrix. Assuming 1.00 ms total time  2 GPU per WFS  Dual GPU board 1: mance-briefs/xeon-phi-product-family-performance-brief.pdf

5 TMT.AOS.PRE DRF01 A Possible GPU RTC Architecture

6 TMT.AOS.PRE DRF01 10GbE surpasses the requirement of 800 MB/s –Widely deployed –Low cost Alternatives: –sFPDP, Camera Link, etc Testing with our Emulex 10 GbE PCI-E Board –Plain TCP/IP –No special feature LGS WFS to RTC Interface

7 TMT.AOS.PRE DRF01 A Sends 0.4 MB of Pixels to B 100,000 Frames Latency: median: 0.465ms  880MB/s Jitter: RMS: 13  s, p/v: 115  s CPU load: 100% of one core

8 TMT.AOS.PRE DRF01 B Sends 0.4MB of Pixels to A 100,000 Frames Latency: median: 0.433ms  950MB/s Jitter: RMS: 6  s, p/v: 90  s

9 TMT.AOS.PRE DRF01 Server with GPUs Each K10 GPU module has two GPUs CPU 1 CPU 2 K10 GPU Module 10GbE Adapter PCIe x8 x16 Memory Built-in 10GbE PCIe WFS DC 1U or 2U Chassis RPG Central Server WFS DC (not accessible in 1U form) x8 x16

10 TMT.AOS.PRE DRF01 Simulation Scheme: Benchmarking the End to End Latency to Qualify this Architecture One LGS WFS Camera 1/6 of RTC Server A Central Server Server B TCP over 10GbE Pixels DM Commands

11 TMT.AOS.PRE DRF01 Hardware –Server A: Dual Xeon –Server B: Intel Core i GHz Two NVIDIA GTX 580 GPU board or One NVIDIA GTX 590 dual GPU board –Emulex OCe11102-NX PCIe Gbase-CR Connected back to back Benchmarking Hardware Configuration

12 TMT.AOS.PRE DRF01 Servers optimization for low latency –Runs real time preempt Linux kernel (rt-linux-3.6) CPU affinity set to core 0. Lock all memory pages Scheduler: SCHED_FIFO at maximum priority Priority: -18 –Disable hyper-threading –Disable virtualization technology and Vt-d in bios (critical) –Disable all non-essential services –No other tasks running CUDA 4.0 C runtime library Vanilla TCP/IP Benchmarking Software Configuration

13 TMT.AOS.PRE DRF01 Benchmarked for an LGS WFS –MVM takes most of the time. –Memory copying is indeed concurrent with computing Pipelining Scheme Calc Grad MVM Copy DM to DME Update MVM Calc Grad MVM … … Pixel toGPU … Frame start Next Frame WFS Read out (0.5ms) 1.25ms Pix over 10GbE … Not to scale Stream 1 Stream 2 Stream 3 Stream synchronized with events

14 TMT.AOS.PRE DRF01 End to End Timing for 100 AO Seconds 2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 12  s. Jitter P/V: 114  s.

15 TMT.AOS.PRE DRF01 End to End Timing for 9 AO hours 2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 13  s. Jitter P/V: 144  s.

16 TMT.AOS.PRE DRF01 End to End Timing for 100 AO Seconds GTX 590 (dual GPU board) Latency Mean: 0.96ms Jitter RMS: 9  s. Jitter P/V: 111  s.

17 TMT.AOS.PRE DRF01 For one LGS WFS –With 1 Nvidia GTX 590 board and 10 GbE interface –<1 ms from end of exposure to DM command ready Pixel read out and transport (0.5ms) Gradient computation MVM computation RTC is made up of 6 such subsystems And one more for soft real-time or background task. Plus one more for (online) spare. Summary

18 TMT.AOS.PRE DRF01 by solving the minimum variance reconstructor in GPUs. (34752x6981)=(34752x62311) x(62311x62311) -1 x(62311x6981) x(6981x6981) -1 Cn2 profile is used for regularization –It needs to be updated frequently –Control matrix need to be updated, using warm restart FDPCG  how many iterations? Benchmarking results of iterative algorithms in GPUs Compute the Control Matrix

19 TMT.AOS.PRE DRF01 Benchmarking Results of Iterative Algorithms for Tomography CG: Conjugate Gradients FD: Fourier Domain Preconditioned CG. OSn: Over sampling n tomography layers (¼ m spacing) Timing (ms)Incr WFE (nm) CG30OS CG30OS CG30OS FD1OS FD1OS FD2OS FD2OS FD3OS FD3OS

20 TMT.AOS.PRE DRF01 Tony Travouillon (TMT) computed covariance function of Cn2 profile from 5 years of site testing data –One function per layer. Independent between layers –Measurement has minimum separation of 1 minute –  Interpolate the covariance to finer time scale Generate Cn2 profile time series at 10 second separation –Pick a case every 500 seconds (60 total for 8 hour duration) –Run MAOS simulations (2000 time steps) with Initializing control matrix with “true” Cn2 profile –FD100 Cold (start) Update control matrix using results from 10 seconds earlier –FD10 or FD20 Warm (restart) How Often to Update the Reconstructor?

21 TMT.AOS.PRE DRF01 Distribution of Seeing of the 60 Cases Case r0 at 10 seconds earlier

22 TMT.AOS.PRE DRF01 Performance of MVM FD100 Cold vs CG30 (baseline iterative algorithm) Only 1 point: MVM worse than CG30

23 TMT.AOS.PRE DRF01 MVM Control Matrix Update (FD10/20 Warm vs FD100 Cold) 8 GPU can do FD10 in 10 seconds.

24 TMT.AOS.PRE DRF01 We demonstrated that end to end hard real time process of the RTC can be done using 10 GbE and GPUs –Pixel transfer over 10GbE –Gradient computation and MVM in 1 GPU board per WFS –Total latency is 0.95 ms. P/V jitter is ~0.1 ms. Control matrix update –Largely sufficient with FD10 warm restart for 10 second stale Cn2 profile. Conclusions

25 TMT.AOS.PRE DRF01 Acknowledgements The author gratefully acknowledges the support of the TMT collaborating institutions. They are –the Association of Canadian Universities for Research in Astronomy (ACURA), –the California Institute of Technology, –the University of California, –the National Astronomical Observatory of Japan, –the National Astronomical Observatories of China and their consortium partners, –and the Department of Science and Technology of India and their supported institutes. This work was supported as well by –the Gordon and Betty Moore Foundation, –the Canada Foundation for Innovation, –the Ontario Ministry of Research and Innovation, –the National Research Council of Canada, – the Natural Sciences and Engineering Research Council of Canada, –the British Columbia Knowledge Development Fund, –the Association of Universities for Research in Astronomy (AURA) –and the U.S. National Science Foundation.

26 TMT.AOS.PRE DRF01 Copying updated MVM matrix to RTC –Do so after DM actuator commands are ready –Measured 0.1 ms for 10 columns –579 time steps to copy 5790 columns. Collect statistics to update matched filter coefficients –Do so after DM actuator commands are ready or in CPUs Complete before next frame arrives. Other tasks