AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Threads, SMP, and Microkernels
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Distributed Systems CS
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Compiler Challenges for High Performance Architectures
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Distributed Shared Memory
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Reference: Message Passing Fundamentals.
Introduction CS 524 – High-Performance Computing.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 17 Parallel Processing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
The hybird approach to programming clusters of multi-core architetures.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
Hybrid MPI and OpenMP Parallel Programming
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Background Computer System Architectures Computer System Software.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Shared Memory
Ramya Kandasamy CS 147 Section 3
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Hybrid Programming with OpenMP and MPI
Multithreaded Programming
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED international conference on Advances in computer science and technology Speaker : Cheng-Jung Wu

Outline Introduction Extensions in EOMP Computing Resource Definition Hierarchical Data Layout and Data Mapping Execution Model for EOMP Experiments and Results Dot Product Matrix multiplication under EOMP execution model on SMP cluster Conclusion

Introduction Clusters of shared-memory multiprocessors (SMPs) More and more popular in High Performance Computing area SMP clusters’ hybrid architectures Supports for a wide range of parallel paradigm Three programming paradigms Standard message passing Hybrid paradigm corresponding to the underlying architecture Shared memory paradigm built on a Software Distribute Software Memory (SDSM) Three major metrics Performance 、 Portability 、 Programmability

Introduction None of the three parallel programming paradigms can meet on all of the three metrics New parallel paradigm (EOMP) A compromising model Balance the three major metrics Features Good programmability Acceptable performance Improve memory behavior Data locality The programs running on SMP cluster Inter-node and intra-node data locality

Extensions in EOMP Since OpenMP Shared-memory systems Lacks the support for distributed memory system New directives Computing resource definition Data mapping

Computing Resource Definition Definitions Virtual node (VN) Virtual processor (VP) VNs Physical nodes Target units of inter-node data distribution VPs Physical processors Target units of intra-node data reallocation and task scheduling during compilation

Computing Resource Definition Semantics of computing resource definition directives Examples for processor mapping are given

Hierarchical Data Layout and Data Mapping : Inter-node Data Mapping Scalar data defined in the EOMP Shared data at default Every node gets an own copy of the data Inter-node task parallel Allows the shared scalar data be modified in certain nodes Global addresses of distributed arrays Llocal addresses Inter-node data mapping distributes the mapped arrays to VNs Semantics for inter-node data distribution directive: #pragma eomp distribute a (BLOCK*) onto N

Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping Shared memory data layout takes the advantage of global address Technically No further data mapping is required inside the nodes In certain cases Improper order of data access would decrease cache performance false sharing or long-stride access For instance : two threads always access the nearby array elements in memory at same time cache performance may be very poor due to severe false sharing Optimizations for intra-node data layout will be necessary

Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping An extreme example Experiment on that circumstance shows an overall 90% reduction of L1 cache miss after the intra-node data reallocation optimization (On 4-cpu IA64 SMP; the array a is of 1M size).

Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping Two strategies can be adopted to reduce cache miss Rearrange the access order of each thread Not always possible for compiler optimization It depends closely on the source program structure In the interleaving data case above, this means to avoid accessing the neighboring data in memory at the same time. Reallocate the data layout in memory, Not change data dependencies of the source program Assures the correctness of this optimization Store the data that accessed by the same thread in a contiguous memory block

Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping Intra-node data reallocation programmer-specified directives compiler reference analysis Intra-node data reallocating data in memory Additional time and space overheads Evaluating the performance speedup of this optimization The data locations have been changed The reallocated data should be forbidden Semantics for intra-node data reallocation directive #pragma eomp distribute a (CYCLIC,*) intra

Execution Model for EOMP Inter-node barriers and broadcasts Modifications of shared variables in the parallel section at the edges of task parallel region Maintain data consistency Inter-node communications use explicit message passing

Execution Model for EOMP Massage passing & multithreading program generated By compiler first distributes data and schedule the tasks across nodes Then deals with the intra- node data reallocation and task scheduling

Experiments and Results : Dot Product

The experiment result shows that the efficiency of the EOMP based on the runtime library is similar to the MPI+OpenMP program (better under some cases) But not good as pure MPI, because the amount of calculations in the dot product operation are not enough, comparing to the cost of inside-node scheduling

Matrix multiplication under EOMP execution model on SMP cluster C=A*B A and C is distributed in rows B is distributed in columns

Matrix multiplication under EOMP execution model on SMP cluster Matrix size is small The cost of inter-node scheduling and communications are relatively high (compared with the computation cost) The three distributed memory models can not acquire a speedup Matrix size becomes larger The three distributed memory models achieve reasonable speedups

Notice that the EOMP model after intra-node data reallocation Gets a high speedup when the matrix size is large Showing that the improved intra-node cache performance can greatly benefit the overall performance of the program on SMP clusters Matrix multiplication under EOMP execution model on SMP cluster

Peaks of EOMP-INDR curves in 500*500 and 1000*1000 cases The effect of data reallocation is related with both the size of cache line local b As the nodes become more and more, the size of local b on each node becomes smaller That means the cache line may fill in more rows of local b, thus the cache misses is reduced Explain why the peak in 500*500 multiplication case comes earlier than that of the 1000*1000 case Matrix multiplication under EOMP execution model on SMP cluster

Conclusion The experiment result Feasibility of our execution model The benefit gained from intra-node data reallocation For future work, we plan to develop a complete source to source EOMP compiler Be based on ORC (Open Resource Compiler for IA64) Our current runtime library prototype Focusing on the communication generation and data management.