Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang.

Slides:



Advertisements
Similar presentations
Michele Punturo INFN Perugia and EGO On behalf of the Einstein Telescope Design Study Team 1GWDAW-Rome 2010.
Advertisements

Keplerian-type parametrization : Its relevance for LISA & SKA Achamveedu Gopakumar TPI, FSU-Jena, Germany Birmingham: 30/3/06.
A SEARCH FOR GRAVITATIONAL WAVES FROM INSPIRALING NEUTRON STARS AND BLACK HOLES Using data taken between July 2009 and October 2010, researchers from the.
07 Nov 2003Gravitational Wave Phenomenology Workshop1 Numerical Relativity Deirdre ShoemakerCornell University  The role of numerical relativity in gravitational-wave.
Current status of numerical relativity Gravitational waves from coalescing compact binaries Masaru Shibata (Yukawa Institute, Kyoto University)
Recent results with Goddard AMR codes Dae-Il (Dale) Choi NASA/Goddard, USRA Collaborators J. Centrella, J. Baker, J. van Meter, D. Fiske, B. Imbiriba (NASA/Goddard)
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
DETAILED TURBULENCE CALCULATIONS FOR OPEN CHANNEL FLOW
Gravitational Wave Sources From Dense Star Clusters Cole Miller University of Maryland.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
(SPH) Simulations of Binary Neutron Star Coalescence Examining the Mass Ratio Dependence of Post-Newtonian Smoothed Particle Hydrodynamics Jonathon Meyers,
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
The Astrophysics of Gravitational Wave Sources Conference Summary: Ground-Based Detectors ( Hz) Kimberly New, LANL.
Zhoujian Cao Institute of Applied Mathematics, AMSS Workshop on Collapsing Objects, Fudan University Generalized Bondi-Sachs equations for Numerical.
C M C C Centro Euro-Mediterraneo per i Cambiamenti Climatici COSMO General Meeting - September 8th, 2009 COSMO WG 2 - CDC 1 An implicit solver based on.
A Periodic Table for Black Hole Orbits Janna Levin Dynamical Systems approach Gabe Perez-Giz NSF Fellow
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Gravitational wave detection and numerical relativity 曹周键 中国科学院数学与系统科学研究院 中国科学技术大学交叉学科理论研究中心.
Gravitational lensing effects on parameters estimation in gravitational wave detection with advanced detectors Zhoujian Cao Institute of Applied Mathematics,
IT253: Computer Organization
Precession during merger R. O’Shaughnessy (UWM) J. Healy, L. London, D. Shoemaker (Georgia Tech) Midwest Relativity Meeting, Chicago arXiv:
Merger of binary neutron stars in general relativity M. Shibata (U. Tokyo) Jan 19, 2007 at U. Tokyo.
Origin of solar systems 30 June - 2 July 2009 by Klaus Jockers Max-Planck-Institut of Solar System Science Katlenburg-Lindau.
Simplified Smoothed Particle Hydrodynamics for Interactive Applications Zakiya Tamimi Richard McDaniel Based on work done at Siemens Corporate.
The inclusion of sub-dominant modes in the signal brings additional modulation in the strain. This effect is visible on the time-frequency profile as measured.
Searching for Gravitational Waves with LIGO Andrés C. Rodríguez Louisiana State University on behalf of the LIGO Scientific Collaboration SACNAS
1 Building Bridges: CGWA Inauguration 15 December 2003 Lazarus Approach to Binary Black Hole Modeling John Baker Laboratory for High Energy Astrophysics.
Principle of Equivalence: Einstein 1907 Box stationary in gravity field Box falling freely Box accelerates in empty space Box moves through space at constant.
A Domain Decomposition Method for Pseudo-Spectral Electromagnetic Simulations of Plasmas Jean-Luc Vay, Lawrence Berkeley Nat. Lab. Irving Haber & Brendan.
15 Dec 2005GWDAW 10 LIGO-G Z1 Overview of LIGO Scientific Collaboration Inspiral Searches Alexander Dietz Louisiana State University for the LIGO.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California CC & M. Vallisneri, PRD.
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
S.Klimenko, December 16, 2007, GWDAW12, Boston, LIGO-G Z Coherent burst searches for gravitational waves from compact binary objects S.Klimenko,
Intermediate mass black hole and numerical relativity Zhoujian Cao Institute of Applied Mathematics, AMSS
High performance computing for Darcy compositional single phase fluid flow simulations L.Agélas, I.Faille, S.Wolf, S.Réquena Institut Français du Pétrole.
Applying Numerical Relativity and EOB to Black Hole Binary Observation Sean McWilliams NASA Goddard Space Flight Center University of Maryland Collaborators:
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Motions of Self-Gravitating bodies to the Second Post- Newtonian Order of General Relativity.
Cosmological Heavy Ion Collisions: Colliding Neutron Stars and Black Holes Chang-Hwan Lee
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.
Einstein 2005 July 21, 2005 Relativistic Dynamical Calculations of Merging Black Hole-Neutron Star Binaries Joshua Faber (NSF AAPF Fellow, UIUC) Stu Shapiro.
Soichiro Isoyama Collaborators : Norichika Sago, Ryuichi Fujita, and Takahiro Tanaka The gravitational wave from an EMRI binary Influence of the beyond.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
LIGO-G09xxxxx-v1 Form F v1 Numerical Simulations of Superkick Binary Black Holes Lydia Nevin Rensselaer Polytechnic Institute LIGO SURF 2014 Mentor:
A novel approach to visualizing dark matter simulations
LIGO-G Z Results from LIGO Observations Stephen Fairhurst University of Wisconsin - Milwaukee on behalf of the LIGO Scientific Collaboration.
Gravitational wave sources
Gravitational lensing effects on parameters estimation in gravitational wave detection with advanced detectors Zhoujian Cao (曹周键) Institute of Applied.
Astrophysics: 2016 highlights and the way forward
Curvature in 2D… Imagine being an ant… living in 2D
Experimental tests of the no-hair theorems of black holes
GW150914: The first direct detection of gravitational waves
Detection of gravitational waves from binary black hole mergers
Inspiral of comparable mass binaries Facilitators: Bala Iyer, Thomas Baumgarte, Jolien Creighton and NS-NS/NS-BH binary merger Facilitators: Fred Rasio,
On Behalf of the LIGO Scientific Collaboration and VIRGO
L Ge, L Lee, A. Candel, C Ng, K Ko, SLAC
Search for gravitational waves from binary black hole mergers:
Searching for GRB-GWB coincidence during LIGO science runs
Albert Einstein and 100 Years of Relativity
Center for Gravitational Wave Physics Penn State University
Black Hole Binaries Dynamically Formed in Globular Clusters
Hangil choi Seoul National university
Parallel Programming in C with MPI and OpenMP
Apply discontinous Galerkin method to Einstein equations
N-Body Gravitational Simulations
Gravitational wave detection and numerical relativity
Presentation transcript:

Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang International School on Numerical Relativity and Gravitational Waves, Pohang Korea

Outline Motivations from gravitational wave detection New parallel mesh refinement numerical scheme GPU acceleration for NR Summary

The most stringent test of GR the "anomalous" precession of the perihelion of Mercury (1915, v≈ ) Deflection of Starlight (1919, v≈ ) gravitational redshift (1965, v≈ ) gravitational time delay effect (1968, v≈ ) Evidence of Gravitational Waves (1978, v≈ ) frame-dragging effect (2010, v≈ ) Direct gravitational wave detection (?, v≈1) GR = Newtonian Gravity + PN(v) + PN(v^2) + ……

Gravitational wave astronomy Search back to extremely early universe Hear the dark universe

Gravitational wave and its detection

Category of Black Holes Super massive black hole: M: 10^5—10^9 Msun Stellar massive black hole: M: 1-10s Msun Intermediate massive black hole: M: 10s—10^5 Msun (mainly in globular cluster) [Farrell, et al, Nature 460 (2009) 73; Feng, et al, New Astronomy Reviews 55 (2011) 166]

Category of Black Holes Binary

IMBH

ALIA Xuefei Gong, et al, CQG 28, (2011) 1:1000 1:1 Advanced LIGO Abadie, et al, PRD 85, (2012) IMBH and GW detection

Data analysis and template Ref to Sang Hoon Oh’s lecture

Template model for BBH ????? Yi Pan’s talk, 2013

Template model for BBH PN templates: for early stage of inspiralling EOBNR (effective one body model together with numerical relativity): for full inspiral + merger + ring down stage; works well for mass ratio less than 1:8 and extreme mass ratio BBH, high spinning, precession! But no reliable template for mass ratio 1:10 to 1:100

From a given separation of the two BHs, when mass ratio increases the number of orbit increases quickly. This requires that the numerical simulation with full GR increases much consequently. In contrast to 1:1, 1:100 needs 10 times more computation cost. PN estimation

Computational cost 1:1, 9 days 1:100, 20 days LSSC cluster II, 128 CPUs, for last 2 orbits computational cost 1 to 20!!

Challenge of large mass BBH to NR Compared to 1:1, the computational cost of 1:100 BBH increase roughly 200 times!! For typical simulation of 1:1 BBH, 14 days are needed. So by straight forward method to 1:100, roughly 1year is needed!!

Possible ways out 1. Physical level: approximation method, such as self force frame work (but still first order yet), …… 2. Numerical Algorithm level: implicit scheme [R. Lau et al, PRD 84, (2011)], combine Cauchy evolution to null evolution, …… 3. Computer level: improve scalability to use more CPUs, use GPU, ……

Possible ways out 1. Physical level: approximation method, such as self force frame work (but still first order yet), …… 2. Numerical Algorithm level: implicit scheme [R. Lau et al, PRD 84, (2011)], combine Cauchy evolution to null evolution, …… 3. Computer level: improve scalability to use more CPUs, use GPU, ……

Mesh refinement scheme High resolution mesh grids for region near BH, while low resolution mesh grids for far region

Mesh refinement in CFD Result based on PARAMESH PARAMESH GrACE JASMIN ……

Comparison of NR and CFD NR (only for BH): computational expensive on single grid point, but functions quite smooth  few grid points (handrads), high order finite difference CFD: computation on single point is cheap, but fluid dynamics quite complex (compare the lectures on HD)  grid number is quite large (millions)

Mesh refinement scheme Scheme adopted by PARAMESH Level 0 Level 1

Mesh refinement scheme Scheme adopted by PARAMESH Level 0Level 1 t x

Mesh refinement scheme Scheme for NR Level 0 Level 1 Distribute data along one level to available processes

Mesh refinement scheme Scheme for NR F. Loeffler et al, CQG 29, (2012) Level 0Level 1 LS scheme

Mesh refinement scheme Parallelization limit: 200x200x200 6 th order finite difference (8 ghost points for two sides) processes How about distribute data on all levels and calculate them parallely?

Parallel mesh level algorithm PX scheme: distribute data on all levels to all processes; calculate parallely

Mesh refinement scheme Procs for lev0 procs for lev1 procs for lev2 run run run wait wait run wait run run wait wait run run run run … … … Strong scalling property due to more data to distribute; Resource wasting (Lx procs of LS) due to waiting! Calculation speed: 2 times faster! time

Parallel mesh level algorithm P2 scheme: distribute data on finest level to half processes and distribute data on other levels along the same level to another half processes; calculate parallely for finest level and other levels, while sequentially for other levels lev0 lev2 lev1

Mesh refinement scheme Procs for lower levels procs for lev2 lev1 run lev0 run lev1 run wait run lev1 run … … Scalling property is weaker than PX; Less waiting (2x procs LS)! Calculation speed: 2 times faster! time

Comparison to LS scheme

more complicate case t x lev0lev1lev2 Now, procs for finest level have to wait!

more complicate case t x lev0lev1lev2

GPU acceleration For system biology, Yamazaki, Igarashi, Neural Networks, 2013 For GW data analysis, Zhihui Du, et al, CQG 29, (2012)

Put RHS calculation to GPU For AMSS-NCKU code, time for RHS calculation > 80% RHS function involves too many variables, even only transform their addresses are time consuming So pack these addresses and store it in constant memory (do not transform any more during evolution), save shared memory at the same time

Put RHS calculation to GPU Keep the data on GPU till MPI data transfer between different processes Using buffer point method to reduce MPI transfer for RK4 from 4 times to only 1 time; also reduce data transfer times between GPU and CPU

Put RHS calculation to GPU Arrange shared memory Divide RHS calculation into 8 parts, let the memory requirement for each part can be satisfied with shared memory For one RHS calculation, copy data from global memory to shared memory once and use shared memory in most time

Put restrict-prolong to GPU After put RHS to GPU, the most time consuming part is Restrict-Prolong interpolation How to treat this part? The work is going on

Test of GPU acceleration on desktop

OpenMP implementation AMSS-NCKU = Fortran90 + C++ C++ used for program flow control and memory administration Fortran90 used for main numerical calculation Add OpenMP command in Fortran90 segments

Structure of AMSS-NCKU GPU code Two groups MPI processes, one for cpu and one for gpu MPI + OpenMP + CUDA

Test of AMSS-NCKU GPU code Titan: top 1 super computer around the world (now Tianhe 2) 1024x16 cores GPUs

Summary Challenge from GW detection: AdvLIGO—1:150 ALIA ---1:1000 Parallel mesh level calculation method—2x speed up GPU implementation to NR---have got roughly 5x speed up; 30x speed up? in progress 10x in all is ready for science simulation