Download presentation
Presentation is loading. Please wait.
Published byBlake Cook Modified over 9 years ago
2
Parallel Programming On the IUCAA Clusters Sunu Engineer
3
IUCAA Clusters The Cluster – Cluster of Intel Machines on Linux Hercules – Cluster of HP ES45 quad processor nodes References: http://www.iucaa.ernet.in/
4
The Cluster Four Single Processor Nodes with 100 Mbps Ethernet interconnect. 1.4 GHz, Intel Pentium 4 512 MB RAM Linux 2.4 Kernel (Redhat 7.2 Distribution) MPI – LAM 6.5.9 PVM – 3.4.3
5
Hercules Four quad processor nodes with Memory Channel interconnect 1.25 GHz Alpha 21264D RISC Processor 4 GB RAM Tru64 5.1A with TruCluster software Native MPI LAM 7.0 PVM 3.4.3
6
Expected Computational Performance Intel Cluster Processor - 512/590 System GFLOPS ~ 2 Algorithm/Benchmark Used – Specint/float/HPL ES45 Cluster Processor ~ 679/960 System GFLOPS ~ 30 Algorithm/Benchmark Used – Specint/float/HPL
7
Parallel Programs Move towards large scale distributed programs Larger class of problems with higher resolution Enhanced levels of details to be explored …
8
The Starting Point Model Single Processor Program Multi Processor Program Model Multiprocessor Program
9
Decomposition of a Single Processor Program Temporal Initialization Control Termination Spatial Functional Modular Object based
10
Multi Processor Programs Spatial delocalization – Dissolving the boundary Single spatial coordinate - Invalid Single time coordinate - Invalid Temporal multiplicity Multiple streams at different rates w.r.t an external clock.
11
In comparison Multiple points of initialization Distributed control Multiple points and times of termination Distribution of the activity in space and time
12
Breaking up a problem
13
Yet Another way
14
And another
15
Amdahl’s Law
16
Degrees of refinement Fine parallelism Instruction level Program statement level Loop level Coarse parallelism Process level Task level Region level
17
Patterns and Frameworks Patterns - Documented solutions to recurring design problems. Frameworks – Software and hardware structures implementing the infrastructure
18
Processes and Threads From heavy multitasking to lightweight multitasking on a single processor Isolated memory spaces to shared memory space
19
Posix Threads in Brief pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments) pthread_exit pthread_join pthread_self pthread_mutex_init pthread_mutex_lock/unlock Link with –lpthread
20
Multiprocessing architectures Symmetric Multiprocessing Shared memory Space Unified Different temporal streams OpenMP standard
21
OpenMP Programming Set of directives to the compiler to express shared memory parallelism Small library of functions Environment variables. Standard language bindings defined for FORTRAN, C and C++
22
Open MP example #include int main(int argc, char ** argv) { #pragma omp parallel { printf(“Hello World from %d\n”,omp_get_thread_num() ); } return(0); } C An openMP program program openmp !$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num() !$OMP END PARALLEL stop end
23
Open MP directives Parallel and Work sharing OMP Parallel [clauses] OMP do [ clauses] OMP sections [ clauses] OMP section OMP single
24
Combined work sharing Synchronization OMP parallel do OMP parallel sections OMP master OMP critical OMP barrier OMP atomic OMP flush OMP ordered OMP threadprivate
25
OpenMP Directive clauses shared(list) private(list)/threadprivate firstprivate/lastprivate(list) default(private|shared|none) default(shared|none) reduction (operator|intrinsic : list) copyin(list) if (expr) schedule(type[,chunk]) ordered/nowait
26
Open MP Library functions omp_get/set_num_threads() omp_get_max_threads() omp_get_thread_num() omp_get_num_procs() omp_in_parallel() omp_get/set_(dynamic/nested)() omp_init/destroy/test_lock() omp_set/unset_lock()
27
OpenMP environment variables OMP_SCHEDULE OMP_NUM_THREADS OMP_DYNAMIC OMP_NESTED
28
OpenMP Reduction and Atomic Operators Reduction : +,-,*,&,|,&&,|| Atomic : ++,--,+,*,-,/,&,>>,<<,|
29
Simple loops do I=1,N z(I) = a * x(I) + y end do !$OMP parallel do do I=1,N z(I) = a * x(I) + y end do
30
Data Scoping Loop index private by default Declare as shared, private or reduction
31
Private variables !$OMP parallel do private(a,b,c) do I=1,m do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do #pragma omp parallel for private(a,b,c)
32
Dependencies Data dependencies (Lexical/dynamic extent) Flow dependencies Classifying and removing the dependencies Non removable dependencies Examples Do I=2,n a(I) =a(I)+a(I-1) end do Do I=2,N,2 a(I)= a(I)+a(I-1) End do
33
Making sure everyone has enough work Parallel overhead – Creation of threads, synchronization vs. work done in the loop $!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime
34
Parallel regions – from fine to coarse parallelism $!OMP Parallel threadprivate and copyin Work sharing constructs do, sections, section, single Synchronization critical, atomic, barrier, ordered, master
35
To distributed memory systems MPI, PVM, BSP …
36
Existing parallel libraries and toolkits include: PUL, the Parallel Utilities Library from EPCC. The Multicomputer Toolbox from Tony Skjellum and colleagues at LLNL and MSU. The Portable, Extensible, Toolkit for Scientific computation from ANL. ScaLAPACK from ORNL and UTK. ESSL, PESSL on AIX PBLAS, PLAPACK, ARPACK Some Parallel Libraries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.