P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Concurrency: Mutual Exclusion, Synchronization, Deadlock, and Starvation in Representative Operating Systems.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
3.5 Interprocess Communication
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Threads CSCI 444/544 Operating Systems Fall 2008.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
1 I/O Management in Representative Operating Systems.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
The hybird approach to programming clusters of multi-core architetures.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Synchronization (Barriers) Parallel Processing (CS453)
1 From Processes to Threads. 2 Processes, Threads and Processors Hardware can interpret N instruction streams at once  Uniprocessor, N==1  Dual-core,
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
PREPARED BY : MAZHAR JAVED AWAN BSCS-III Process Communication & Threads 4 December 2015.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Operating Systems CMPSC 473 Signals, Introduction to mutual exclusion September 28, Lecture 9 Instructor: Bhuvan Urgaonkar.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
Background Computer System Architectures Computer System Software.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Chapter 4 – Thread Concepts
CS5102 High Performance Computer Systems Thread-Level Parallelism
Chapter 4 – Thread Concepts
3- Parallel Programming Models
Process Management Presented By Aditya Gupta Assistant Professor
Threads and Cooperation
Department of Computer Science University of California, Santa Barbara
MPI-Message Passing Interface
CGS 3763 Operating Systems Concepts Spring 2013
Background and Motivation
Programming with Shared Memory
Chapter 4: Threads & Concurrency
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication for Multicore By Rajesh Nishtala

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB What Are Collectives  An operation called by all threads together to perform globally coordinated communication  May involve a modest amount of computation, e.g. to combine values as they are communicated  Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads  Focus on collectives in Single Program Multiple Data (SPMD) programming models 2Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Some Collectives  Barrier ( (MPI_Barrier() )  A thread cannot exit a call to a barrier until all other threads have called the barrier  Broadcast ( MPI_Bcast() )  A root thread sends a copy of an array to all the other threads  Reduce-To-All ( MPI_Allreduce() )  Each thread contributes an operand to an arithmetic operation across all the threads  The result is then broadcast to all the threads  Exchange ( MPI_Alltoall() )  For all i, j < N, thread i copies the j th piece of an input array to the i th slot of an output array located on thread i. 3Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Why Are They Important?  Basic communication building blocks  Found in many parallel programming languages and libraries  Abstraction  If an application is written with collectives, passes the responsibility of tuning to the runtime Opteron/Infi- band/256 Class CClass D Exchange in NAS FT ~28%~23% Reductions in NAS CG ~42%~28% Percentage of runtime spent in collectives 4Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Setup  Platforms  Sun Niagra2 1 socket of 8 multi-threaded cores Each core supports 8 hardware thread contexts for 64 total threads  Intel Clovertown 2 “traditional” quad-core sockets  BlueGene/P 1 quad-core socket  MPI for Inter-process communication  shared memory MPICH Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2)  Barrier Performance  Perform a barrier across all 64 threads  Threads arranged into processes in different ways –One extreme has one thread per process while other has 1 process with 64 threads  MPI_Barrier() called between processes  Flat barrier amongst threads  2 orders of magnitude difference in performance! 6Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2) cont.  Other collectives see similar issues with scaling using processes  MPI Collectives called between processes while shared memory is leverage within a process 7Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Intel Clovertown and BlueGene/P  Less threads per node  Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P 8Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Barrier w/ Trees  Leveraging shared memory is a critical optimization  Flat trees are don’t scale  Use to aid parallelism  Requires two passes of a tree  First (UP) pass indicates that all threads have arrived. Signal parent when all your children have arrived Once root gets signal from all children then all threads have reported in  Second (DOWN) pass indicates that all threads have arrived Wait for parent to send me a clear signal Propagate clear signal down to my children Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example Tree Topologies Radix 4 k-nomial tree (quadnomial) Radix 8 k-nomial tree (octnomial) Radix 2 k-nomial tree (binomial) 10Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Barrier Performance Results  Time many back-to-back barriers  Flat tree is just one level with all threads reporting to thread 0  Leverages shared memory but non-scalable  Architecture Independent Tree (radix=2)  Pick a generic “good” radix that is suitable for many platforms  Mismatched to architecture  Architecture Dependent Tree  Search overall radices to pick the tree that best matches the architecture GOODGOOD 11Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Broadcast Performance Results  Time a latency sensitive Broadcast (8 Bytes)  Time Broadcast followed by Barrier and subtract time for Barrier  Yields an approximation for how long it takes for the last thread to get the data GOODGOOD 12Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Reduce To All Performance Results  4kBytes (512 Doubles) Reduce-To-All  In addition to data movement we also want to parallelize the computation  In Flat approach, computation gets serialized at the root  Tree based approaches allow us to parallelize the computation amongst all the floating point units  8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way GOODGOOD 13Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimization Summary  Relying on flat trees is not enough for most collectives  Architecture dependent tuning is a further and important optimization GOODGOOD 14Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Extending the Results to a Cluster  Use one-rack of BlueGene/P (1024 nodes or 4096 cores)  Reduce-To-All by having one thread representative thread make call to inter-node all reduce  Reduce the number of messages in the network  Vary the number of threads per process but use all cores  Relying purely on shared memory doesn’t always yield the best performance  Reduces number of active cores working on computation drops  Can optimize so that computation is partitioned across cores Not suitable for direct call to MPI_Allreduce() 15Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: Ø pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 1 Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Proc 1 thinks collective is done Multicore Collective Tuning16

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Strict v. Loose Synchronization  A fix to the problem  Add barrier before/after the collective  Enforces global ordering of the operations  Is there a problem?  We want to decouple synchronization from data movement  Specify the synchronization requirements Potential to aggregate synchronization Done by the user or a smart compiler  How can we realize these gains in applications? 17Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions  Processes  Threads is a crucial optimization for single-node collective communication  Can use tree-based collectives to realize better performance, even for collectives on one node  Picking the correct tree that best matches architecture yields the best performance  Multicore adds to the (auto)tuning space for collective communication  Shared memory semantics allow us to create new loosely synchronized collectives 18Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions? 19Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Backup Slides 20Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads and Processes  Threads  A sequence of instructions and an execution stack  Communication between threads through common and shared address space No OS/Network involvement needed Reasoning about inter-thread communication can be tricky  Processes  A set of threads and and an associated memory space  All threads within process share address space  Communication between processes must be managed through the OS Inter-process communication is explicit but may be slow More expensive to switch between processes 21Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Platforms Niagra2 Clovertown BG/P 22Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Specs Niagra2 Clovertown BlueGene/P # Sockets121 # Cores/Socket844 Threads Per Core811 Total Thread Count6484 Instruction Set Sparc x86/64 PowerPC Core Frequency1.4 GHz2.6 GHz0.85 GHz Peak DP Floating Point Performance / Core1.4 GFlop/s10.4 GFlop/s3.4 GFlop/s DRAM Read Bandwidth / Socket42.7 GB/s21.3 GB/s13.6 GB/s DRAM Write Bandwidth / Socket21.3 GB/s10.7 GB/s13.6 GB/s L1 Cache Size 8kB 32 kB L2 Cache Size 4MB 16MB 8MB (shared) (4MB/2cores) OS Version Solaris 5.10 Linux BG/P Compute Kernel C Compiler Sun C (5.9) Intel ICC (10.1) IBM BlueGene XLC MPI Implementation MPICH MPICH2 port for ch3:shm device BG/P 23Multicore Collective Tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Details of Signaling  For optimum performance have many readers and one writer  Each thread sets a flag (a single word) that others will read  Every reader will get a copy of the cache-line and spin on that copy  When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes  Avoid atomic primitives  On way up the tree, child sets a flag indicating that subtree has arrived  Parent spins on that flag for each child  On way down, each child spins on parent’s flag  When it’s set, it indicates that the parent wants to broadcast the clear signal down  Flags must be on different cache lines to avoid false sharing  Need to switch back-and-forth between two sets of flags 24Multicore Collective Tuning