HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

Slides:



Advertisements
Similar presentations
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Advertisements

Beowulf Supercomputer System Lee, Jung won CS843.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
FastOS, Santa Clara CA, June Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Overview of Recent MCMD Developments Manojkumar Krishnan January CCA Forum Meeting Boulder.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
MIMD Distributed Memory Architectures message-passing multicomputers.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
November 9, 2000 PDCS-2000 A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory.
Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit Jarek Nieplocha Robert Harrison, Manoj Kumar Krishnan.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
Compiler and Tools: User Requirements from ARSC Ed Kornkven Arctic Region Supercomputing Center DSRC HPC User Forum September 10, 2009.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Center for Component Technology for Terascale Simulation Software (CCTTSS) 110 April 2002CCA Forum, Townsend, TN This work has been sponsored by the Mathematics,
CCA Distributed Framework Interoperability. Goals Assume you have two (or more) framework instances. –Assume it contains a network of component instances.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
System Components Operating System Services System Calls.
Operating System Structure Lecture: - Operating System Concepts Lecturer: - Pooja Sharma Computer Science Department, Punjabi University, Patiala.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Is System X for Me? Cal Ribbens Computer Science Department
Alternative Processor Panel Results 2008
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
MPJ: A Java-based Parallel Computing System
Support for Adaptivity in ARMCI Using Migratable Objects
Function of Operating Systems
Presentation transcript:

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory

2 HPC Components Distributed Arrays Component Global Arrays (GA) Parallel I/O Component Disk Resident Arrays (DRA) One-sided Communication Component Remote Memory Access (RMA) Communication Aggregate Remote Memory Copy Interface (ARMCI)

3 Distributed Array Component Based on Global Arrays (GA) Core Capabilities dense arrays 1-7 dimensions global rather than per-task view of data structures user control over data distribution: regular and irregular GAClassicPort (direct+indirect) GA methods GADADFPort distributed array descriptors (DAD) and templates proposed by Data Working Group of CCA Forum LinearAlgebraPort (LA) manipulating vectors, matrices, and linear solvers (for TAO) physically distributed dense array single, shared data structure global indexing (e.g., A(4,3) rather than buf(7) on task 2) GA LA DAD GA Classic

4 Distributed Arrays Data Locality, distribution Ease of programming High performance Gets 5.2 GFLOP/s per CPU out of 6 GFLOP/s peak MPIGA -Invert Data Locally -Identify where (process ranks) to send the data -find # of MPI_Recv’s to post -Manipulate the global indices for each Recv (identify where each data fit locally) -Do the actual data transfer. - Invert Data Locally -Do a GA_Put d transpose (inverse data globally)

5 Parallel I/O Component Based on Disk Resident Arrays High-level API for transfer of data between N-dim arrays stored on disk and distributed arrays stored in memory Uses parallel or local filesystems Hides filesystem issues Scalable performance utilizing local disks of a cluster More nodes used – more disks available – higher aggregate b/w Use when Arrays too big to store in core checkpoint/restart out-of-core solvers Development Ohio State collaboration (P. Sadayappan) Non-collective I/O Data reorganization/layout Recent paper at LACSI array in memory array on disk(s)

6 Communication Component Based on ARMCI Aggregate Remote Memory Copy Interface Used in Global Arrays, Rice Co-Array Fortran compiler, Ames GPSHMEM, Co-Array Python Vendor supported (Cray XD1, IBM porting to BG/L) One sided communication (put/get model) Remote Memory Access CCA component offers language interoperability Only C interface existed in ARMCI Comm Driver ARMCI Elan (Quadrics) ARMCI GM (Myrinet) ARMCI Vapi (Infiniband) ARMCI Sockets (Ethernet) (Any) Component P1P0 put remote memory access (RMA) 1-sided model A B Plug-and-play for network drivers using CCA

7 Processor Group Issues in Distributed Array Management Access to data in components running on different processor groups Identifying the rank of processes/thread and group naming in component interfaces Data movement and Reorganization An instance of MxN problem revisited For component interoperability would like support from framework identifying and naming processes/groups distributed and parallel environments, hybrid Threads/processes, MPI/PVM issues MPI GA Comp AComp B CCA Framework