Introduction to parallel programming modelS CS 5802 Monica Borra
Overview Types of parallel programming models Shared memory Model OpenMP POSIX Threads Cilk/Cilk Plus/Cilk Plus Plus Thread Building Blocks
Parallel Programming Model A set of software technologies to express parallel algorithms and match applications with the underlying parallel systems. “an abstraction above hardware and memory architectures” Types of Parallel Programming Models: Shared Memory model, Threads Model, Distributed Memory model and Hybrid Models
Programming models NOT specific to a particular type of machine or memory architecture. “Virtual Shared Memory” Machine memory is physical distributed across networked machines, but appeared to the user as a single shared memory global address space. Every task has direct access to global address space yet the ability to send and receive messages using MPI can be implemented.
Shared Memory Common block of read/write memory among processes Create Shared memory segment is created by the first process. Other processes know the key and have access to the shared memory segment. So, they can attach and share with one another. Create Shared Memory (unique key) MAX ptr Attach ptr Attach Proc. 3 Proc. 4 Proc. 5 ptr Proc. 2 Proc. 1 int shmget(key_t key, size_t size, int shmflg);
Thread Models Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Data Racing Problem. - Require synchronization to ensure that no more than one thread is updating the same global address at any time.
Several Thread Libraries/systems PTHREADS is the POSIX Standard OpenMP standard for application level programming TBB: Thread Building Blocks CILK: Language of the C “ilk” Java threads
Distributed memory model A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
Open Multi Processing A simple API that allows to add parallelism into existing source code without significantly having to rewrite it. Programming in C/C++/Fortran. It is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer It is composed of a set of compiler directives, library routines, and environment variables Easier to understand and maintain.
Fork-Join Model.
(Note that launching more thread than number of processing unit available can actually slow down the whole program ) OpenMP Since it is compiler directive based, it requires a compiler that supports. The directives can be added incrementally – gradual parallelization.
OpenMP Example: include<iostream> #include<omp.h> using namespace std; /******************************************************************** Sample OpenMP program which at stage 1 has 4 threads and at stage 2 has 2 threads **********************************************************/ int main() { #pragma omp parallel num_threads(4) //*create 4 threads and region inside it will be executed by all threads . */ #pragma omp critical//allow one thread at a time to access below statement cout<<" Thread Id in OpenMP stage 1= "<<omp_get_thread_num()<< endl; } //here all thread get merged into one thread id cout<<"I am alone"<<endl; #pragma omp parallel num_threads(2)//create two threads cout<<" Thread Id in OpenMP stage 2= "<<omp_get_thread_num()<< endl;; } Command to run executable with name a.out on Linux : /a.out Output Thread Id in OpenMP stage 1= 2 Thread Id in OpenMP stage 1=0 Thread Id in OpenMP stage 1=3 Thread Id in OpenMP stage 1= 1 I am alone Thread Id in OpenMP stage 2= 1 Thread Id in OpenMP stage 2=0
OpenMP Advantages Programmer need not specify the processors ( nodes) No need for message passing since it uses a shared memory Its style of coding fits for both serial and parallel paradigms Ability to deal with coarse-grain parallelism with shared memory Disadvantages Runs efficiently only on shared memory platforms. Scalability is hindered due to shared memory architecture No reliable error handling mechanisms. Synchronization between subset threads isn’t allowed.
POSIX THREADS POSIX: Portable Operating System Interface for UNIX - Interface to Operating System utilities PThreads: The POSIX threading interface Implementations of the API are available in C/C++ on many Unix-like OS. However, we need third-party packages such as pthreads-w32, which implements pThreads on top of existing Windows API. Pthreads defines a set of programming language types, functions and constants. It is implemented with a pthread.h header and a thread library. There are around 100 Pthreads procedures, all prefixed "pthread_" and they can be categorized into four groups: Thread Management, Mutexes, Condition Variables, Synchronization.
Forking a POSIX Thread: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg); thread_id is the thread id or handle (used to halt, etc.) thread_attribute various attributes a. Standard default values obtained by passing a NULL pointer b. Sample attribute: minimum stack size thread_fun the function to be run (takes and returns void*) fun_arg an argument can be passed to thread_fun when it starts errorcode will be set nonzero if the create operation fails
Some other functions: pthread_yield(); Informs the scheduler that the thread is willing to yield its quantum, requires no arguments. pthread_exit(void *value); Exit thread and pass value to joining thread (if exists) pthread_join(pthread_t *thread, void **result); Wait for specified thread to finish. Place exit value into *result. pthread_t me; me = pthread_self(); Allows a pthread to obtain its own identifier pthread_t thread; pthread_detach(thread); Informs the library that the threads exit status will not be needed by subsequent pthread_join calls resulting in better threads performance.
Simple Example: void* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); return 0; Compile using gcc –lpthread
CILK/CILK PLUS/CILK++ Programming Languages which extend C and C++. Initially developed by MIT, based on ANSI C now belongs to Intel. Initial applications of Cilk were only in high performance computing. Intel Cilk Plus keywords: cilk_spawn - Specifies that a function call can execute asynchronously, without requiring the caller to wait for it to return. This is an expression of an opportunity for parallelism, not a command that mandates parallelism. The Intel Cilk Plus runtime will choose whether to run the function in parallel with its caller. cilk_sync - Specifies that all spawned calls in a function must complete before execution continues. There is an implied cilk_sync at the end of every function that contains a cilk_spawn. cilk_for - Allows iterations of the loop body to be executed in parallel. Also introduces "Reducers” provide a lock-free mechanism that allows parallel code to use private "views" of a variable which are merged at the next sync.
Example of Cilk Plus int fib(int n) { if (n < 2) return n; int x = cilk_spawn fib(n-1); int y = fib(n-2); cilk_sync; return x + y; } Uses the header file <cilk/cilk.h> for (int i = 0; i < 8; ++i) { cilk_spawn do_work(i); } cilk_sync; cilk_for (int i = 0; i < 8; ++i) { do_work(i); }
Thread Building Blocks(TBB) A C++ template library developed by Intel for parallel programming on multi-core processors. TBB enables you to specify tasks instead of Threads TBB is compatible with other threading packages A TBB program creates, synchronizes and destroys graphs of dependent tasks according to algorithms, i.e. high-level parallel programming paradigms ( Algorithmic Skeletons) TBB emphasize scalable, data parallel programming Optimizes core utilization. May result in scheduling overhead.
TBB COMPONENTS Basic algorithms: parallel_for, parallel_reduce, parallel_scan Advanced algorithms: parallel_while, parallel_do, parallel_pipeline, parallel_sort Containers: concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map Memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store TBB relies on generic programming. It is similar to Standard Tag Library. Detailed explanation of TBB Components
High Performance Fortran Extension of Fortran 90 with constructs that support parallel computing Allows efficient implementation on both SIMD and MIMD style architectures Implicit parallelizing (mapping, distribution, communication, synchronization) High productivity
Parallel Virtual Machine (pvm) Enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Supports software execution on each machine in a user-configurable pool Heterogeneous applications that can exploit specific strengths of individual machines on a network. Set of dynamic resource manager and powerful process control functions Fault tolerant (that can survive host or task failures) and portable.
THANK YOU!