IEEE Workshop on HSLN 16 Nov 2004 SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit Hung-Hsun Su, Burton C. Gordon,

Slides:

Advertisements

Similar presentations

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Advertisements

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Types of Parallel Computers

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Parallel Processing Architectures Laxmi Narayan Bhuyan

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Storage area network and System area network (SAN)

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

7/30/031 System-Area Networks (SAN) Group Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters Dr. Alan D. George, Director HCS.

5/12/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters Dr. Alan D. George, Principal Investigator Mr. Burton C. Gordon, Sr.

CSE 661 PAPER PRESENTATION

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

UPC Research Activities at UF Presentation for UPC Workshop ’04 Alan D. George Hung-Hsun Su Burton C. Gordon Bryan Golden Adam Leko HCS Research Laboratory.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Interconnection network network interface and a case study.

10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

Background Computer System Architectures Computer System Software.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Constructing a system with multiple computers or processors

Parallel Processing Architectures

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Networks Networking has become ubiquitous (cf. WWW)

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

MPJ: A Java-based Parallel Computing System

Types of Parallel Computers

Presentation transcript:

IEEE Workshop on HSLN 16 Nov 2004 SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit Hung-Hsun Su, Burton C. Gordon, Sarp Oral, and Alan D. George HPN and UPC Groups HCS Research Laboratory University of Florida

IEEE Workshop on HSLN 16 Nov Outline Introduction Design Results Updates Conclusions

IEEE Workshop on HSLN 16 Nov Introduction Goal: Enable and evaluate UPC computing over SCI networks Parallel programming models have emerged to provide programmers alternative ways in solving complex and computationally intensive problems  Message passing  Shared memory  Global address space (GAS) Message-passing and shared-memory programming models are the two most popular approaches GAS model quickly gaining momentum  Programming can be very complex in message-passing model  Shared-memory model does not work well on server clusters  GAS model attempts to provide the best of both worlds Ease of programming Support for various system architectures

IEEE Workshop on HSLN 16 Nov Introduction Unified Parallel C (UPC)  Partitioned GAS language  Parallel extension to the ISO C standard  Smaller learning curve for people with C experience Available on various platforms  Vendor supplied Cray T3D, T3E HP AlphaServer SC, SMP, and PA-RISC SMP Under development: IBM, Sun  Open source Berkeley UPC (BUPC)  Wide variety of architectures and operating systems  Large-scale multiprocessors  PC clusters  Clusters of shared-memory multiprocessors Intrepid GCC-UPC  SCI Origin 2000, 3000, IRIX, ALTIX Michigan Tech UPC (MUPC)  Linux clusters

IEEE Workshop on HSLN 16 Nov Introduction Global Address Space Networking (GASNet)  Communication system developed at UCB/LBNL  Language-independent and network-independent communication middleware  Provides high-performance comm. primitives aimed at supporting GAS languages  Used by Berkeley UPC, Titanium (parallel extension of Java), and Intrepid GCC-UPC  Supports execution on UDP, MPI, Myrinet, Quadrics, InfiniBand, and IBM LAPI

IEEE Workshop on HSLN 16 Nov Introduction Scalable Coherent Interface (SCI) – IEEE Standard  High-performance interconnect for system-area-network (SAN) clusters and embedded systems Allows memory on each node of system to be accessed by every other node on network Uses point-to-point links  1D, 2D, 3D Low-latency transfer  Single-digit microseconds for remote write  Tens of microseconds for remote read High data rate  5.3 Gb/s Good fit for GASNet  Dolphin SISCI API Standard set of API calls provided by hardware vendor to access and control SCI hardware Two modes of transfer  PIO  Low-latency, shared-memory operation  Requires memory importation (maps a portion of virtual memory to remote memory region)  DMA  High-bandwidth, zero-copy operation  No memory importation required

IEEE Workshop on HSLN 16 Nov Design GASNet is divided into two layers  Core API Narrow interface based on Active Messages (AM)  Extended API Provides medium- and high-level operations on remote memory and collective operations Reference version available using the Core API  Successful implementation of Core API sufficient for a complete GASNet conduit Core API Design  Support the three types of AM messages Short: header only Medium: header with payload stored in buffer space Long: header with payload stored in designated address  Uses remote write exclusively to achieve maximum performance  Communication regions Command region  Buffer space for incoming AM requests and replies  Able to hold k pairs of request/reply  Pairing ensures a deadlock-free system  Size of each request/reply buffer space = longest AM header size + max. medium payload size  PIO transfer mode

IEEE Workshop on HSLN 16 Nov Design Communication regions (cont.)  Control region Buffer space for message flags  Allow nodes to check for incoming message locally  Message-ready flags  One for each request/reply message  Indicates existence of a particular incoming request/reply message  Message-exist flag  One per node  Indicates existence of any incoming message PIO transfer mode  Payload region Corresponds to range of remotely accessible memory as specified by the GAS languages (user-defined) DMA transfer mode  Achieves higher bandwidth  Improves scalability of system

IEEE Workshop on HSLN 16 Nov Design AM communication 1.Obtain free slot Tracked locally using array of flags 2.Package AM header 3.Transfer data Short AM  PIO write (Header) Medium AM  PIO write (Header)  PIO write (Medium payload) Long AM  PIO write (header)  PIO write (long payload)  Payload size  1024  Unaligned portion of payload  DMA write (multiple of 64 bytes) 4.Wait for transfer completion 5.Signal AM arrival PIO write  Message-ready flag = type of AM  Message-exist flag = TRUE 6.Wait for reply/control signal Free up remote slot for reuse

IEEE Workshop on HSLN 16 Nov Results Experimental testbeds  SCI, Quadrics, InfiniBand, and MPI conduits (via testbed at UF) Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus Quadrics (Elan): 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch InfiniBand (VAPI): 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO port switch from Infinicon RedHat 9.0 with gcc compiler V 3.3.2, MPI uses MPICH 1.2.5, Berkeley UPC runtime system 2.0  Myrinet (GM) conduit (via testbed at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16- port M3F-SW16 switch RedHat 7.3 with Intel C compiler V 7.1, Berkeley UPC runtime system 2.0  ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter- processor connection network Tru64 5.1B Unix, HP UPC V2.1 compiler * via testbed made available courtesy of Michigan Tech

IEEE Workshop on HSLN 16 Nov Results Core-level experiments  SCI raw (SISCI API) scipp (PIO benchmark, ping-pong) dma_bench (DMA benchmark, one-way)  GASNet conduits (GASNet test suite) testam (AM round-trip latency)

IEEE Workshop on HSLN 16 Nov Results Extended-level experiments  GASNet conduits (GASNet test suite) testsmall (put/get round-trip latency) testlarge (put/get bandwidth) Spike in the Myrinet conduit was consistent over many runs and does seem to be an anomally

IEEE Workshop on HSLN 16 Nov Updates UPC-level experiments (where # of threads = # of processors)  IS (Class A) from NAS benchmark IS (Integer Sort), lots of fine-grain communication, low amount of computation  DES Differential Attack Simulator S-DES (8-bit key) cipher (integer-based), bandwidth-intensive application IS BenchmarkDES Benchmark

IEEE Workshop on HSLN 16 Nov Conclusions Experimental version of our conduit is available as part of Berkeley UPC V2.0+ release Despite being limited by existing SCI driver from vendor, conduit able to achieve performance fairly comparable to other HPN conduits of GASnet Enhancements to resolve driver limitations being investigated in close collaboration with Dolphin  Support access of all virtual memory on remote node  Minimize transfer setup overhead

IEEE Workshop on HSLN 16 Nov Questions and Answers

IEEE Workshop on HSLN 16 Nov Appendix - GASNet Latency on Conduits

IEEE Workshop on HSLN 16 Nov Appendix - GASNet Throughput on Conduits