S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Lecture 6: Multicore Systems
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Introductions to Parallel Programming Using OpenMP
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Case Study: PRACE Autumn School in HPC Programming Techniques November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up Parallel I/O on the SP David Skinner, NERSC Division, Berkeley Lab.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
HPC1 OpenMP E. Bruce Pitman October, HPC1 Outline What is OpenMP Multi-threading How to use OpenMP Limitations OpenMP + MPI References.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Threaded Programming Lecture 2: Introduction to OpenMP.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 TB SSA Disk StorageTek Tape Libraries 830 GB MaxStrat.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Shared Memory Parallelism - OpenMP
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Overview Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results A word on coarse grain techniques Coarse grain results Time variability Effects of thread binding Final Conclusions

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Blue Horizon Hardware 144 IBM SP High Nodes Each node: 8-way SMP 4 GB memory crossbar Each processor: Power3 222 MHz 4 Flop/cycle Aggregate peak Tflop/s Compilers: IBM mpxlf_r, version KAI guidef90, version 3.9

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Blue Horizon Hardware Interconnect (between nodes): Currently: 115 MB/s 4 MPI tasks/node Must use OpenMP to utilize all processors Soon: 500 MB/s 8 MPI tasks/node Can use OpenMP to supplement MPI (if it’s worth it)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid Programming: why use it? Non-performance-related reasons Avoid replication of data on the node Performance-related reasons: Avoid latency of MPI on the node Avoid unnecessary data copies inside the node Reduce latency of MPI calls between the nodes Decrease global MPI operations (reduction, all-to-all) The price to pay: OpenMP Overheads False sharing Is it really worth trying?

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid Programming Two methods of combining MPI and OpenMP in parallel programs Fine grainCoarse grain main program ! MPI initialization.... ! cpu intensive loop !$OMP PARALLEL DO do i=1,n !work end do.... end main program !MPI initialization !$OMP PARALLEL.... do i=1,n !work end do.... !$OMP END PARALLEL end

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid programming Fine grain approach Easy to implement Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO) Coarse grain approach Time-consuming implementation Performance: less overhead for thread creation

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid NPB using fine grain parallelism CG, MG, and FT suites of NAS Parallel Benchmarks (NPB). Suite name# loops parallelized CG - Conjugate Gradient18 MG - Multi-Grid 50 FT - Fourier Transform8 Results shown are the best of 5-10 runs Complete results at

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - CG (A&B class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - MG (A&B class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - MG (C class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - FT (A&B class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - FT (C class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid NPB using coarse grain parallelism: MG suite Overview of the method Task 2 Task 3 Task 4 Task 1 Thread 1Thread 2

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain programming methodology Start with MPI code Each MPI task spawns threads once in the beginning Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region Main arrays are global Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking. Avoid using OMP DO Careful with scoping and synchronization

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (A class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (C class)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (C class) Full node results # of SMP Nodes MPI Tasks x OpenMP Threads Max MOPS/CPU 84x x Min MOPS/CPU 644x x x2 641x x8642x4642x

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Variability times (on 64 nodes) Seen mostly when the full node is used Seen both in fine grain and coarse grain runs Seen both with IBM and KAI compiler Seen in runs on the same set of nodes as well as between different sets On a large number of nodes, the average performance suffers a lot Confirmed in micro-study of OpenMP on 1 node

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE OpenMP on 1 node microbenchmark results Benchmarks/Open/open.html

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Question: is variability related to thread migration? A study on 1 node: –Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds –Monitor processor id and run time for each thread –Repeat 100 times –Threads bound OR not bound

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Results for OMP_NUM_THREADS=8 Without binding, threads migrate in about 15% of the runs With thread binding turned on there was no migration 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown Slowdown occurs with/without binding Effect of single thread slowdown Probability that complete calculation will be slowed P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon P= probability overall results slowed by 25%

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Calculation was rerun OMP_NUM_THREADS = % reduction in computational power No threads showed a slowdown, all ran in about 1.6 seconds Summary OMP_NUM_THREADS = 7 –yields 12.5% reduction in computational power OMP_NUM_THREADS = 8 – probability overall results slowed by 25% independent of thread binding

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Overall Conclusions Based on our study of NPB on Blue Horizon: Fine grain hybrid approach is generally worse than pure MPI Coarse grain approach for MG is comparable with pure MPI or slightly better Coarse grain approach is time and effort consuming Coarse grain techniques are given Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads Thread binding does not influence performance