Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Slides:

Advertisements

Similar presentations

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Interconnect Networks

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

High Performance Computing on the Cell Broadband Engine

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.

HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

MPI Communications Point to Point Collective Communication Data Packaging.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Interconnection network network interface and a case study.

MPI implementation – collective communication MPI_Bcast implementation.

Introduction to Research 2007 Introduction to Research 2007 Ashok Srinivasan Florida State University Recent collaborators V.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Network Connected Multiprocessors

CS5102 High Performance Computer Systems Thread-Level Parallelism

Course Outline Introduction in algorithms and applications

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Multi-Node Broadcasting in Hypercube and Star Graphs

Parallel Computers Today

Large data arrays processing on Cell Broadband Engine

Presentation transcript:

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State University Goals 1.Efficient implementation of collectives for intra-Cell MPI 2.Evaluate the impact of different algorithms on the performance Collaborators: A. Kumar 1, G. Senthilkumar 1, M. Krishna 1, N. Jayam 1, P.K. Baruah 1, R. Sarma 1, S. Kapoor 2 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin Acknowledgment: IBM, for providing access to a Cell blade under the VLP program

Outline Cell Architecture Intra-Cell MPI Design Choices Barrier Broadcast Reduce Conclusions and Future Work

A PowerPC core, with 8 co-processors (SPE) with 256 K local store each Shared 512 MB - 2 GB main memory - SPEs can DMA Peak speeds of Gflops in single precision and Gflops in double precision for SPEs GB/s EIB bandwidth, 25.6 GB/s for memory Two Cell processors can be combined to form a Cell blade with global shared memory Cell Architecture DMA put times Memory to Memory Copy using: SPE local store memcpy by PPE

Intra-Cell MPI Design Choices Cell features  In order execution, but DMAs can be out of order  Over 100 simultaneous DMAs can be in flight Constraints  Unconventional, heterogeneous architecture  SPEs have limited functionality, and can act directly only on local stores  SPEs access main memory through DMA  Use of PPE should be limited to get good performance MPI design choices  Application data in: (i) local store or (ii) main memory  MPI data in: (i) local store or (ii) main memory  PPE involvement: (i) active or (ii) only during initialization and finalization  Collective calls can: (i) synchronize or (ii) not synchronize

Barrier (1) OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list OTA: Like OTA List, but root notifies others through individual non-blocking DMAs SIG: Like OTA, but others notify root through a signal register in OR mode Degree-k TREE  In each step, a node has k-1 children  In the first phase, children notify parents  In the second phase, parents acknowledge children

Barrier (2) PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension DIS: In step i, SPU j sends to SPU j + 2 i and receives from j – 2 i Comparison of MPI_Barrier on different hardware PCell (PE)  s Xeon/Myrinet  s NEC SX-8  s SGI Altix BX2  s 80.4  10  13   14  5 Alternatives  Atomic increments in main memory – several microseconds  PPE coordinates using mailbox – tens of microseconds

Broadcast (1) OTA on 4 SPUs  OTA: Each SPE copies data to its location Different shifts are used to avoid hotspots in memory  Different shifts on larger number of SPUs yield results that are close to each other AG on 16 SPUs  AG: Each SPE is responsible for a different portion of data Different minimum sizes are tried

Broadcast (2) TREEMM on 12 SPUs  TREEMM: Tree structured Send/Recv type implementation  Data for degrees 2 and 4 are close  Degree 3 is best, or close to it, for all SPU counts TREE on 16 SPUs  TREE: Pipelined tree structured communication based on local stores  Results are similar to this figure for other SPU counts

Broadcast (3) Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data Broadcast with good choice of algorithms for each data size and SPE count Maximum main memory bandwidth is also shown

Broadcast (4) Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor  The total bandwidth to memory for a node is 512 GB/s  Nodes are connected through a crossbar switch capable of 16 GB/s in each direction The Altix is a CC-NUMA system with a global shared memory  Each node contains eight Itanium 2 processors  Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between nodes 1.6 GB/s Data Size Cell (PE)  sInfiniband  sNEC SX-8  sSGI Altix BX2  s P = 8P = 16P = 8P = 16P = 8P = 16P = 8P = B  18  10 1 KB  25  KB  MB  100  215  2600  3100 Comparison of MPI_Bcast on different hardware

Reduce Reduce of MPI_INT with MPI_SUM on 16 SPUs  Similar trends were observed for other SPU counts Data Size Cell (PE)  s IBM SP  s NEC SX-8  sSGI Altix BX2  s P = 8P = 16 P = 8P = 16 P = 8P = B  40 1 KB  60 1 MB   230  350   Each node of the IBM SP was a 16- processor SMP Comparison of MPI_Bcast on different hardware

Conclusions and Future Work Conclusions  The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory  But local store should be used effectively, with double buffering to hide latency  Main memory bandwidth is then the bottleneck Current and future work  Implemented Collective communication operations optimized for contiguous data  Future work Optimize collectives for derived data types with non-contiguous data