Comparison of Threading Programming Models

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Presented by Rengan Xu LCPC /16/2014
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
What is Concurrent Programming? Maram Bani Younes.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Martin Kruliš by Martin Kruliš (v1.1)1.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Intel Many Integrated Cores Architecture
Chapter 4: Multithreaded Programming
Chapter 4: Threads.
Introduction to OpenMP
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
CS427 Multicore Architecture and Parallel Computing
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
Chapter 4 – Thread Concepts
3- Parallel Programming Models
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Computer Science Department
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Multi-core CPU Computing Straightforward with OpenMP
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Constructing a system with multiple computers or processors
What is Concurrent Programming?
Hybrid Programming with OpenMP and MPI
Using OpenMP offloading in Charm++
What is Concurrent Programming?
Introduction to OpenMP
Chapter 4: Threads & Concurrency
Foundations and Definitions
Synchronization These notes introduce:
Introduction to CILK Some slides are from:
Mattan Erez The University of Texas at Austin
Presentation transcript:

Comparison of Threading Programming Models Solmaz Salehian, Jiawen Liu and Yonghong Yan Department of Computer Science and Engineering Oakland University HIGH-LEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HIPS) ORLANDO, FLORIDA USA MAY 29-JUNE 2, 2017  Thank you for your time coming for this talk. It is about two conventional, and important topics: programming models and compilers, we already know. But now in a special time, they are different challenges than before. I hope you find it interesting.

Contents Motivation Comparison of threading programming models: List of features : parallelism patterns, abstractions of memory hierarchy and programming for data locality, mutual exclusion, error handling, tools support, language binding Runtime scheduling: fork-join execution, work-stealing, work-sharing Performance evaluation Conclusion

Multi-threading APIs Thread Intra-node threading programming models Smallest sequence of programmed instructions Multi-threading improves performance and concurrency Intra-node threading programming models High level: OpenMP, OpenACC , Intel Cilk Plus Mid level: C++11, Intel TBB Low level: Nvidia CUDA, OpenCL, PThreads Aim of comparison Used as a guideline for user to choose a proper API Helping user to choose parallelism pattern Providing performance-wise evaluation the smallest sequence of programmed instructions that can be managed independently by a scheduler Threading creates additional independent execution paths

List of Features Parallelism patterns Data parallelism Task parallelism Data/event driven Offloading Abstraction of memory hierarchy and programming for data locality Synchronizations Barrier Reduction Join Mutual exclusion Error handling Tools support Language binding

Parallelism Patterns Data parallelism: same task works on different data in parallel Task parallelism: different tasks work in parallel on different/same data Offloading: offload a program to a target device (GPU and Many core CPU) Data/event driven computation: captures computations characterized as data flow programming paradigm supporting the parallel execution of a stream of data tasks

Parallelism Patterns asynchronous tasking or threading can be viewed as the foundational parallel mechanism that is supported by all the models. Overall, OpenMP provides the most comprehensive set of features to support all the four parallelism patterns. Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs) The program run on the device(Cpu, Gpu) is called kernel. Opencl program requires writing code for host and device sides. OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. both OpenACC and OpenMP provide high-level offloading constructs Only OpenMP and Cilk Plus provide constructs for vectorization support For data/event driven parallelism, C++’s std::future, OpenMP’s depend clause, and OpenACC’s wait are all for user to specify asynchronous task dependency to achieve such kind of parallelism. The class template std::future provides a mechanism to access the result of asynchronous operations

Abstractions of Memory Hierarchy and Synchronizations Programming models that support manycore architectures provide interfaces for organizing a large number threads (x1000) into a two-level thread hierarchy, e.g., OpenMP’s teams of threads, OpenACC’s gang/worker/vector clause, CUDA’s blocks/threads and OpenCL’s work group thread memory model includes interfaces for a rich memory consistency model and guarantees sequential consistency for programs without data races [6], that are not available in most others, except the OpenMP’s flush directive API provides construct to support binding of computation and data to influence runtime execution under the principle of locality; The high-level affinity interface uses an environment variable to determine the machine topology and assigns OpenMP* threads to the processors based upon their physical location in the machine. Models that support offloading computation provide constructs to specify explicit data movement between discrete memory spaces. Models that do not support other compute devices do not require them. Synchronizations: A programming model often provides constructs for supporting coordination between parallel work units. Barrier is method to stop thread until the last thread reaches to the point. It is implicit. The BARRIER directive synchronizes all threads in the team. The TASKWAIT construct specifies a wait on the completion of child tasks generated since the beginning of the current task. Because the taskwait construct does not have a C language statement as part of its syntax, there are some restrictions on its placement within a program. The taskwait directive may be placed only at a point where a base language statement is allowed. The taskwait directive may not be used in place of the statement following an if, while, do, switch, or label. See the OpenMP 3.1 specifications document for details.

Mutual Exclusion, Error Handling, Tool Support, and Language Binding The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. 

Runtime Scheduling Fork-join execution Master thread is the single thread in the sequential parts Master thread forks a team of worker threads Upon exiting parallel region, all threads synchronize and join Dynamic loop schedules have much higher overheads than staAc schedules Fig.1. Diagram of Fork-join[1] [1] http://www.nersc.gov/users/software/programming-models/openmp/openmp-resources/

Runtime Scheduling Work-stealing Work-sharing Is used by Cilk Plus Each worker thread has a double-ended queue (deque) The scheduler dynamically balance the load by stealing tasks Work-sharing Distributes the iteration of a parallel loop in data parallelism Fig.2. Diagram of work-stealing[2] [2] http://actor-framework.readthedocs.io/en/stable/Scheduler.html

Runtime Scheduling OpenMP Cilk Plus and TBB Hybrid of multiple runtime technologies Fork-join, work-stealing and work-sharing Cilk Plus and TBB Use random work-stealing Runtime for low-level programming models C++ std::thread, CUDA, OpenCL, and PThreads could be simpler

Performance Comparison Applications Simple kernels: AXPY, Matrix multiplication, Matrix vector multiplication, Sum, Fib Applications from Rodinia: BFS, Hotspot, LUD, LavaMD, SRAD APIs OpenMP, Cilk Plus, C++11 Parallelism patterns Data and task parallelism

Constructs for Data and Task Parallelism Data parallelism pragma omp parallel for cilk_for Task parallelism pragma omp task and taskwait cilk_spawn and cilk_sync std::async and future Simulate data parallelism for C++11 Using a for loop and manually chunking iterations among threads std::thread and join

Applications AXPY: solves the equation y = a * x + y MVM: Matrix Vector Multiplication MM: Matrix Multiplication Sum: calculates sum of a * X[N] Fib: Fibonacci numbers Simple applications Problem sizes AXPY 100 million MM 2k MVM 40k Sum Fib 40

Applications BFS: Breadth-First Search Hotspot: estimates processor temperature LUD: LU Decomposition (LUD) accelerates solving linear equation LavaMD: calculates relocation due to mutual forces SRAD: Speckle Reducing Anisotropic Diffusion (SRAD) is for removing noise in ultrasonic/radar imaging Rodinia applications Problem sizes BFS 16 million nodes Hotspot 8192 LUD 2048 LavaMD 10 box SRAD

Machine for Evaluation Two-socket Intel Xeon E5-2699v3 CPUs each with 18 physical cores (totally 36 cores) 256 GB of 2133 MHz DDR4 ECC memory forming a shared-memory NUMA system Operating system is CentOS 6.7 Linux with kernel version 2.6.32-573.12.1.el6.x86 64 Intel icc compiler version 13.1.3 The Rodinia version 3.1

Simple Kernels Fig.3. AXPY performance Fig.4. MVM performance However, as the computation intensity increases from AXPY to Matvec and Matmul, we see less impact of runtime scheduling to the performance. Avoid creation too many threads we use threshold in recursive versions Fig.5. MM performance

Simple Kernels Fig.6. Sum performance Fig.7. Fib performance Intel OpenMP runtime uses lock-based deque for pushing, popping and stealing tasks for omp task Lock-based deque increases contention and overhead   Synchronization is expensive.

Simple Kernels Work-stealing for data Parallelism introduces high overhead Work-stealing could cost more and not worth for small tasks Work-sharing shows better performance for data parallelism Work-stealing has better performance for task parallelism

Rodinia Applications SRAD and LavaMD applications: Fig.8. LavaMD performance Fig.9. Hotspot performance SRAD and LavaMD applications: thread receives task with same amount of work Fig.10. SRAD performance

Rodinia Applications Work-sharing introduces less overhead for data parallelism Work-stealing has better performance for task parallelism (LUD and BFS applications) For uniform task workloads different implementations perform closely (LavaMD and SRAD applications) Tasking outperforms data parallelism, when there is dependency in different parallel loop phases (Hotspot application)

Conclusion Comparison of language features and runtime scheduling Performance comparison of three popular parallel programming models OpenMP, Cilk Plus and C++11 Task and data parallelism Execution times for different implementations vary because of Different strategies of load balancing in the runtime Different/same task workload of applications Scheduling and loop distribution overhead Work-stealing runtime creates high overhead for data parallelism Works-sharing is more suitable for data parallelism

Acknowledgment This work is based upon work supported by the National Science Foundation under Grant No. 1409946 and 1652732.