INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Slides:



Advertisements
Similar presentations
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Parallel multi-grid summation for the N-body problem Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.
CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
Solving the Poisson Integral for the gravitational potential using the convolution theorem Eduard Vorobyov Institute for Computational Astrophysics.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSCI-455/552 Introduction to High Performance Computing Lecture 11.5.
Paolo Miocchi in collaboration with R. Capuzzo-Dolcetta, P. Di Matteo, A. Vicari Dept. of Physics, Univ. of Rome “La Sapienza” (Rome, Italy) Work supported.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Interconnection network network interface and a case study.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
ChaNGa: Design Issues in High Performance Cosmology
CRESCO Project: Salvatore Raia
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Course Outline Introduction in algorithms and applications
Hybrid Programming with OpenMP and MPI
Parallel Programming in C with MPI and OpenMP
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel computing in Computational chemistry
N-Body Gravitational Simulations
Presentation transcript:

INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body cosmological code on IBM SP M. Comparato, U. Becciani, C. Gheller, V. Antonuccio The N-Body project at OACT The FLY code Performance analysis with LAPI and MPI-2 Questions

INAF Osservatorio Astrofisico di Catania FLY Project People: V. Antonuccio, U. Becciani, M. Comparato, D. Ferro Funds: Included in the project “Problematiche Astrofisiche Attuali ed Alta Formazione nel Campo del Supercalcolo” funded by MIUR with more than 500,000 Euros + 170,000 Euros INAF provides grants on MPP systems at CINECA Resources: IBM SP, SGI Origin systems, Cray T3E

INAF Osservatorio Astrofisico di Catania 24 Processors 222 Mhz Global RAM Memory : 48 Gbytes Disk Space: 254 GB (72.8 GB HD per node GB HD cws) Network Topology: SPS scalable Omega switch and FastEthernet node interconnection type Bandwidth: 300 Mbyte/s peak bi-directional transfer rate Programming Language: C, C++, Fortran 90 Parallel paradigms: OpenMP, MPI, LAPI IBM SP POWER3 INAF Astrophysical Observatory of Catania

INAF Osservatorio Astrofisico di Catania NEW SYSTEM 8 Processors 1.1 GHz Global RAM Memory : 16 Gbytes Disk Array: 254 GB L2: 1.5 Mbytes L3: 128 Mbyte Memory: 2 GB per processor IBM POWER4 P650 INAF Astrophysical Observatory of Catania

With this method matter is represented as a set of particles: each particle is characterized by mass, position and velocity, the only force considered is gravity, the evolution of the system is obtained integrating numerically the equation of motion on a proper time interval Gravitational N-body problem The cosmological simulation The N-body technique allows to perform cosmological simulation that describe the evolution of the universe.

Gravitational N-body problem The cosmological simulation Direct interaction (P-P) method is conceptually the simplest. … but it scale as O(N 2 ) and this makes impossible to run simulations with more than N >= 10 5 particles. to overcome this problem tree- or mesh-based algorithms have been developed, these scale as NlogN and N but, only supercomputers, and parallel codes, allow the user to run simulations with N>= 10 7 particles

FLY: Parallel tree N-body code for Cosmological Applications Based on the Barnes-Hut Algorithm (J. Barnes & P. Hut, Nature, 324, 1986) Fortran 90 parallel code High Performance Code for MPP/SMP architecture using one-side communication paradigm: SHMEM – LAPI It runs on Cray-T3E System, SGI ORIGIN, and on IBM SP. Typical Simulations require 350 MB Ram for 1 million particles

Gravitational N-body problem The Barnes-Hut algorithm The particles evolve according to the laws of Newtonian physics where d ij = x i - x j Considering a region  the force component on the i- particle may be computed as:

Gravitational N-body problem Tree Formation root 2D Domain decomposition and tree structure. The split of each sub-domain is carried out until only one body (a leaf) is contained in each box

Gravitational N-body problem Force Computation The force on any particle is computed as the sum of the forces by the nearby particles plus the force by the distant cells whose mass distributions are approximated by multipole series truncated, typically, at the quadrupole order. cell mark the If   cmi d Cellsize Two phases: tree walk procedure force computation

FLY Block Diagram SYSTEM INIZIALIZATION TREE FORMATION and BARRIER FORCE COMPUTATION TREE INSPECTION ACC. COMPONENTS BARRIER UPDATE POSITIONS and BARRIER TIME STEP CYCLE STOP

Parallel implementation of FLY Data distribution Two main data structures: particles tree Particles are distribuited in blocks such that each processor has the same number of contiguous bodies. E.g. with 4 processors: Tree structure is distribuited in a cyclic way such that each processor has the same number of cells

Parallel implementation of FLY Work distribution Each processor calculates the force for its local particles. To do that the whole tree structure (which is distributed among processors) must be accessed asyncronously (one-side communications required) This leads to a huge communication overhead

FLY: Tips and tricks In order to lower the problems related to communication overhead, we have implemented several “tricks” Dynamical Load Balancing: processors help each other Grouping: close particles have the same interaction with far distributions of mass Data Buffering

Data buffering: free RAM segments are dynamically allocated to store remote data (tree cell properties and remote bodies) already accessed during the tree walk procedure. Performance improvement: 16 Million bodies on Cray T3E, 32 PEs 156 Mbytes for each PE Without Buffering Each PE executes 700 GET operations for each local body Using Buffering Each PE execute ONLY GET operations for each local body FLY: Data buffering

Why FLYing from LAPI to MPI-2 LAPI is a propretary parallel programming library (IBM) Implementing FLY using MPI-2 improves the portability of our code RMA calls introduced in MPI-2 make the porting simple, since there is a direct correspondence between basic functions lapi_get(…) lapi_put(…) mpi_get(…) mpi_put(…) MPI-2 doesn’t have an atomic fetch and increment call

MPI-2 syncronization MPI_Win_lock and MPI_Win_unlock mechanism when we need the data just after the call when only one processor access remote data MPI_Win_fence mechanism when we can separate non-RMA access from RMA access when all the processors access remote data at the same time

MPI-2 syncronization FLY algorithm requires continuous asyncronous access to remote data passive target syncronization is needed we have to use the lock/unlock mechanism.

MPI-2 Drawback Unfortunately, Lock and unlock are usually not implemented efficiently (or they are not even implemented at all) LAM: not implemented MPICH: not implemented MPICH2: I am waiting next release IBM AIX: poor performance IBM Turbo MPI2: testing phase

FLY 3.3 Problem: poor performance of lock/unlock calls Walk-around: rewrite portion of code (where possible) to separate non-RMA accesses from RMA accesses in order to use the fence calls Result: MPI-2 version runs 2 times faster Why don’t we port these changes on the LAPI version? FLY 3.3 was born

Static part: Tree generation Cell properties … Dynamic part: Interaction list Force computation … FLY 3.2 VS FLY 3.3 2M particles test

FLY 3.2 VS FLY 3.3 2M particles test Total simulation time Scalability: timing normalized on the number of processors (t n xn)/t 1

Static part: Tree generation Cell properties … Dynamic part: Interaction list Force computation … FLY 3.3 VS FLY MPI-2 2M particles test

Conclusions Present: Low performce MPI2 version of FLY (for now) More scalable LAPI version of FLY Future: TurboMPI2 MPICH2 (porting on Linux clusters) OO interface to hydrodynamic codes (FLASH)

Dynamical load balancing: … FLY: dynamical load balancing and grouping Grouping: …