November 9, 2000 PDCS-2000 A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory.

Slides:



Advertisements
Similar presentations
MPI3 RMA William Gropp Rajeev Thakur. 2 MPI-3 RMA Presented an overview of some of the issues and constraints at last meeting Homework - read Bonachea's.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Parallel Computing Overview CS 524 – High-Performance Computing.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Parallel Processing Architectures Laxmi Narayan Bhuyan
AMMPI - Summary Active Messages–2 (AM) implementation over MPI version 1.1 –Porting is trivial - works on virtually any platform that has MPI 1.1 –Often.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Hossein Bastan Isfahan University of Technology 1/23.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Synchronization and Communication in the T3E Multiprocessor.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
MIMD Distributed Memory Architectures message-passing multicomputers.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
2/22/2001Greenbook 2001/OASCR1 Greenbook/OASCR Activities Focus on technology to enable SCIENCE to be conducted, i.e. Software tools Software libraries.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Parallel Tools Platform Parallel Debugger Greg Watson Project Leader Greg Watson Project Leader.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Chapter 4: Threads.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Chapter 4: Threads.
MPI-Message Passing Interface
Parallel Processing Architectures
Chapter 4: Threads & Concurrency
Support for Adaptivity in ARMCI Using Migratable Objects
Programming Parallel Computers
Presentation transcript:

November 9, 2000 PDCS-2000 A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Overview zIntroduction yglobal address space programming model yone-sided communication zCray SHMEM zGPSHMEM - Generalized Portable SHMEM zImplementation Approach zExperimental Results zConclusions

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Global Address Space and 1-Sided Communication (0xf5670,P0) (0xf32674,P5) P0 P1P2 collection of address spaces of processes in a parallel job global address: (address, pid) message passing P1 P0 receive send But not P1P0 put one-sided communication Communication model hardware examples: Cray T3E, Fujitsu VPP5000 language support: Co-Array Fortran, UPC

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Motivation: global address space versus other programming models

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 One-sided communication interfaces zFirst commercial implementation - SHMEM on the Cray T3D yput, get, scatter, gather, atomic swap ymemory consistency issues (solved on the T3E) ymaps well to the Cray T3E hardware - excellent application performance zVendors specific interfaces yIBM LAPI, Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan zPortable Interfaces  MPI-2 1-sided (related but rather restrictive model) yARMCI one-sided communication library ySHMEM (some platforms) yGPSHMEM -- first fully portable implementation of SHMEM

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 History of SHMEM zIntroduced in on the Cray T3D in 1993 yone-sided operations: put, get, scatter, gather, atomic swap ycollective operations: synchronization, reduction ycache not coherent w.r.t. SHMEM operations (problem solved on the T3E) yhighest level of performance on any MPP at that time zIncreased availability ySGI after purchasing Cray ported to IRIX systems and Cray vector systems xbut not always full functionality (w/o atomic ops on vector systems like Cray J90) xextensions to match more datatypes - SHMEM API is datatype oriented yHPVM project lead by Andrew Chien (UIUC/UCSD) xported and extended a subset of SHMEM xon top of Fast Messages for Linux (later dropped) and Windows clusters yQuadrics/Compaq port to Elan xavailable on Linux and Tru64 clusters with QSW switch ysubset on top of LAPI for the IBM SP xinternal porting tool by the IBM ACTS group at Watson

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Characteristics of SHMEM zMemory addressability ysymmetric objects ystack, heap allocation on the T3D yCray memory allocation routine shmalloc zOrdering of operations yordered in the original version on the T3D yout-of-order on the T3E xadaptive routing, added shmem_quiet zProgress rules yfully one-sided, no explicit or implicit polling by remote node ymuch simpler model than MPI-2 1-sided xno redundant locking or remote process cooperation P1P0 shmem_put(a,b,n,0) Symmetric object a a b

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 GPSHMEM zFull interface of the Cray T3D SHMEM version zOrdering of operations zPortability restriction: must use shmalloc for memory allocation zExtensions for block strided data transfers ythe original Cray strided interface involved single elements yGPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock, proc) ploc shmem_strided_get prem nblock lstride nbytes shmem_iget Cray SHMEM GPSHMEM

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 GPSHMEM implementation approach ARMCI message-passing library (MPI,PVM) Platform-specific communication interfaces (active messages, RMC, threads, shared memory) one-sided operations collective operations SHMEM interfaces Run-time support

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 ARMCI portable 1-sided communication library zFunctionality yput, get, accumulate (also with noncontiguous interfaces) yatomic read-modify-write, mutexes and locks ymemory allocation operations zCharacteristics ysimple progress rules - truly one-sided yoperations ordered w.r.t. target (ease of use) ycompatible with message-passing libraries (MPI, PVM) ylow-level system, no Fortran API zPortability yMPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops) yclusters of Unix and Windows systems (Myrinet,VIA,TCP/IP) ylarge servers with shared memory: SGI, Sun, Cray SV1, HP

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Multiprotocols in ARMCI (IBM SP example) Process/thread synchronization Active Messages threads shared memory Remote memory copy between nodes SMP AMs used for noncontiguous transfers and atomic operations Places all user’s data in shared memory! ARMCI_Malloc()

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Experience zPerformance studies yGPSHMEM overhead over SHMEM on the Cray T3E yComparison to MPI-2 1-sided on the Fujitsu VX-4 zApplications - see paper ymatrix multiplication on a Linux cluster yporting Cray T3E codes

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 GPSHMEM Overhead on the T3E zApproach yrenamed GPSHMEM calls to avoid conflict with Cray SHMEM ycollected latency and bandwidth numbers zOverhead yshmem_put 3.5  s yshmem_get 3  s ybandwidth is the same since GPSHMEM and ARMCI do not add extra memory copies zDiscussion ythe overhead includes GPSHMEM and ARMCI yreflects address conversion xsearching table of addresses for allocated objects xcan be avoided when addresses are identical ARMCI GPSHMEM Cray SHMEM

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Performance of GPSHMEM and MPI-2 on the Fujitsu VX-4

Pacific Northwest National Laboratory Ames Laboratory PDCS-2000 Conclusions zDescribed a fully portable implementation of SHMEM-like library ySHMEM becomes a viable alternative to MPI-2 1-sided yGood performance closely tied up to ARMCI yOffers potential wide portability to other tools based on SHMEM xe.g. Co-Array Fortran zCray SHMEM API incomplete for strided data structures yextensions for block strided transfers improve performance zMore work with applications needed to drive future extensions and development  Code availability: