Comparison of Threading Programming Models

Comparison of Threading Programming Models
Solmaz Salehian, Jiawen Liu and Yonghong Yan Department of Computer Science and Engineering Oakland University HIGH-LEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HIPS) ORLANDO, FLORIDA USA MAY 29-JUNE 2, 2017 Thank you for your time coming for this talk. It is about two conventional, and important topics: programming models and compilers, we already know. But now in a special time, they are different challenges than before. I hope you find it interesting.

Contents Motivation Comparison of threading programming models:
List of features : parallelism patterns, abstractions of memory hierarchy and programming for data locality, mutual exclusion, error handling, tools support, language binding Runtime scheduling: fork-join execution, work-stealing, work-sharing Performance evaluation Conclusion

Multi-threading APIs Thread Intra-node threading programming models
Smallest sequence of programmed instructions Multi-threading improves performance and concurrency Intra-node threading programming models High level: OpenMP, OpenACC , Intel Cilk Plus Mid level: C++11, Intel TBB Low level: Nvidia CUDA, OpenCL, PThreads Aim of comparison Used as a guideline for user to choose a proper API Helping user to choose parallelism pattern Providing performance-wise evaluation the smallest sequence of programmed instructions that can be managed independently by a scheduler Threading creates additional independent execution paths

List of Features Parallelism patterns
Data parallelism Task parallelism Data/event driven Offloading Abstraction of memory hierarchy and programming for data locality Synchronizations Barrier Reduction Join Mutual exclusion Error handling Tools support Language binding

Parallelism Patterns Data parallelism: same task works on different data in parallel Task parallelism: different tasks work in parallel on different/same data Offloading: offload a program to a target device (GPU and Many core CPU) Data/event driven computation: captures computations characterized as data ﬂow programming paradigm supporting the parallel execution of a stream of data tasks

Parallelism Patterns asynchronous tasking or threading can be viewed as the foundational parallel mechanism that is supported by all the models. Overall, OpenMP provides the most comprehensive set of features to support all the four parallelism patterns. Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs) The program run on the device(Cpu, Gpu) is called kernel. Opencl program requires writing code for host and device sides. OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. both OpenACC and OpenMP provide high-level ofﬂoading constructs Only OpenMP and Cilk Plus provide constructs for vectorization support For data/event driven parallelism, C++’s std::future, OpenMP’s depend clause, and OpenACC’s wait are all for user to specify asynchronous task dependency to achieve such kind of parallelism. The class template std::future provides a mechanism to access the result of asynchronous operations

Abstractions of Memory Hierarchy and Synchronizations
Programming models that support manycore architectures provide interfaces for organizing a large number threads (x1000) into a two-level thread hierarchy, e.g., OpenMP’s teams of threads, OpenACC’s gang/worker/vector clause, CUDA’s blocks/threads and OpenCL’s work group thread memory model includes interfaces for a rich memory consistency model and guarantees sequential consistency for programs without data races [6], that are not available in most others, except the OpenMP’s flush directive API provides construct to support binding of computation and data to influence runtime execution under the principle of locality; The high-level affinity interface uses an environment variable to determine the machine topology and assigns OpenMP* threads to the processors based upon their physical location in the machine. Models that support offloading computation provide constructs to specify explicit data movement between discrete memory spaces. Models that do not support other compute devices do not require them. Synchronizations: A programming model often provides constructs for supporting coordination between parallel work units. Barrier is method to stop thread until the last thread reaches to the point. It is implicit. The BARRIER directive synchronizes all threads in the team. The TASKWAIT construct specifies a wait on the completion of child tasks generated since the beginning of the current task. Because the taskwait construct does not have a C language statement as part of its syntax, there are some restrictions on its placement within a program. The taskwait directive may be placed only at a point where a base language statement is allowed. The taskwait directive may not be used in place of the statement following an if, while, do, switch, or label. See the OpenMP 3.1 specifications document for details.

Mutual Exclusion, Error Handling, Tool Support, and Language Binding
The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it.

Runtime Scheduling Fork-join execution
Master thread is the single thread in the sequential parts Master thread forks a team of worker threads Upon exiting parallel region, all threads synchronize and join Dynamic loop schedules have much higher overheads than staAc schedules Fig.1. Diagram of Fork-join[1] [1]

Runtime Scheduling Work-stealing Work-sharing Is used by Cilk Plus
Each worker thread has a double-ended queue (deque) The scheduler dynamically balance the load by stealing tasks Work-sharing Distributes the iteration of a parallel loop in data parallelism Fig.2. Diagram of work-stealing[2] [2]

Runtime Scheduling OpenMP Cilk Plus and TBB
Hybrid of multiple runtime technologies Fork-join, work-stealing and work-sharing Cilk Plus and TBB Use random work-stealing Runtime for low-level programming models C++ std::thread, CUDA, OpenCL, and PThreads could be simpler

Performance Comparison
Applications Simple kernels: AXPY, Matrix multiplication, Matrix vector multiplication, Sum, Fib Applications from Rodinia: BFS, Hotspot, LUD, LavaMD, SRAD APIs OpenMP, Cilk Plus, C++11 Parallelism patterns Data and task parallelism

Constructs for Data and Task Parallelism
Data parallelism pragma omp parallel for cilk_for Task parallelism pragma omp task and taskwait cilk_spawn and cilk_sync std::async and future Simulate data parallelism for C++11 Using a for loop and manually chunking iterations among threads std::thread and join

Applications AXPY: solves the equation y = a * x + y
MVM: Matrix Vector Multiplication MM: Matrix Multiplication Sum: calculates sum of a * X[N] Fib: Fibonacci numbers Simple applications Problem sizes AXPY 100 million MM 2k MVM 40k Sum Fib 40

Applications BFS: Breadth-First Search
Hotspot: estimates processor temperature LUD: LU Decomposition (LUD) accelerates solving linear equation LavaMD: calculates relocation due to mutual forces SRAD: Speckle Reducing Anisotropic Diffusion (SRAD) is for removing noise in ultrasonic/radar imaging Rodinia applications Problem sizes BFS 16 million nodes Hotspot 8192 LUD 2048 LavaMD 10 box SRAD

Machine for Evaluation
Two-socket Intel Xeon E5-2699v3 CPUs each with 18 physical cores (totally 36 cores) 256 GB of 2133 MHz DDR4 ECC memory forming a shared-memory NUMA system Operating system is CentOS 6.7 Linux with kernel version el6.x86 64 Intel icc compiler version The Rodinia version 3.1

Simple Kernels Fig.3. AXPY performance Fig.4. MVM performance
However, as the computation intensity increases from AXPY to Matvec and Matmul, we see less impact of runtime scheduling to the performance. Avoid creation too many threads we use threshold in recursive versions Fig.5. MM performance

Simple Kernels Fig.6. Sum performance Fig.7. Fib performance Intel OpenMP runtime uses lock-based deque for pushing, popping and stealing tasks for omp task Lock-based deque increases contention and overhead   Synchronization is expensive.

Simple Kernels Work-stealing for data Parallelism introduces high overhead Work-stealing could cost more and not worth for small tasks Work-sharing shows better performance for data parallelism Work-stealing has better performance for task parallelism

Rodinia Applications SRAD and LavaMD applications:
Fig.8. LavaMD performance Fig.9. Hotspot performance SRAD and LavaMD applications: thread receives task with same amount of work Fig.10. SRAD performance

Rodinia Applications Work-sharing introduces less overhead for data parallelism Work-stealing has better performance for task parallelism (LUD and BFS applications) For uniform task workloads different implementations perform closely (LavaMD and SRAD applications) Tasking outperforms data parallelism, when there is dependency in different parallel loop phases (Hotspot application)

Conclusion Comparison of language features and runtime scheduling
Performance comparison of three popular parallel programming models OpenMP, Cilk Plus and C++11 Task and data parallelism Execution times for different implementations vary because of Different strategies of load balancing in the runtime Different/same task workload of applications Scheduling and loop distribution overhead Work-stealing runtime creates high overhead for data parallelism Works-sharing is more suitable for data parallelism

Acknowledgment This work is based upon work supported by the National Science Foundation under Grant No and

Comparison of Threading Programming Models

Similar presentations

Presentation on theme: "Comparison of Threading Programming Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparison of Threading Programming Models

Similar presentations

Presentation on theme: "Comparison of Threading Programming Models"— Presentation transcript:

Similar presentations

About project

Feedback