Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing.

Parallel Programming AMANO, Hideharu

Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing Compilers

Message passing （ Blocking: randezvous ） Send Receive Send Receive

Message passing （ with buffer ） Send Receive Send Receive

Message passing （ non-blocking ） Send Receive Other Job

PVM (Parallel Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is provided. Barrier synchronization

MPI (Message Passing Interface) Superset of the PVM for 1 to 1 communication. Group communication Various communication is supported. Error check with communication tag.

Programming style using MPI SPMD (Single Program Multiple Data Streams)  Multiple processes executes the same program.  Independent processing is done based on the process number. Program execution using MPI  Specified number of processes are generated.  They are distributed to each node of the NORA machine or PC cluster.

Communication methods Point-to-Point communication  A sender and a receiver executes function for sending and receiving.  Each function must be strictly matched. Collective communication  Communication between multiple processes.  The same function is executed by multiple processes.  Can be replaced with a sequence of Point-to-Point communication, but sometimes effective.

Fundamental MPI functions Most programs can be described using six fundamental functions  MPI_Init() … MPI Initialization  MPI_Comm_rank() … Get the process #  MPI_Comm_size() … Get the total process #  MPI_Send() … Message send  MPI_Recv() … Message receive  MPI_Finalize() … MPI termination

Other MPI functions Functions for measurement  MPI_Barrier() … barrier synchronization  MPI_Wtime() … get the clock time Non-blocking function  Consisting of communication request and check  Other calculation can be executed during waiting.

An Example 1: #include 2: #include 3: 4: #define MSIZE 64 5: 6: int main(int argc, char **argv) 7: { 8: char msg[MSIZE]; 9: int pid, nprocs, i; 10: MPI_Status status; 11: 12: MPI_Init(&argc, &argv); 13: MPI_Comm_rank(MPI_COMM_WORLD, &pid); 14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 15: 16: if (pid == 0) { 17: for (i = 1; i < nprocs; i++) { 18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status); 19: fputs(msg, stdout); 20: } 21: } 22: else { 23: sprintf(msg, "Hello, world! (from process #%d)\n", pid); 24: MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD); 25: } 26: 27: MPI_Finalize(); 28: 29: return 0; 30: }

Initialize and Terminate int MPI_Init( int *argc, /* pointer to argc */ char ***argv /* pointer to argv */ ); mpi_init(ierr) integer ierr ! return code The attributes from command line must be passed directly to argc and argv. int MPI_Finalize(); mpi_finalize(ierr) integer ierr ! return code

Commincator functions It returns the rank (process ID) in the communicator comm. int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ ); mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code It returns the total number of processes in the communicator comm. int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ ); mpi_comm_size(comm, size, ierr) integer comm, size integer ierr ! return code Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.

MPI_Send It sends data to process “dest”. int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ ); mpi_send(buf, count, datatype, dest, tag, comm, ierr) buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code Tags are used for identification of message.

MPI_Recv int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ ); mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members: MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.

datatype and count The size of the message is identified with count and datatype.  MPI_CHAR char  MPI_INT int  MPI_FLOAT float  MPI_DOUBLE double … etc.

Compile and Execution % icc –o hello hello.c -lmpi % mpirun –np 8./hello Hello, world! (from process #1) Hello, world! (from process #2) Hello, world! (from process #3) Hello, world! (from process #4) Hello, world! (from process #5) Hello, world! (from process #6) Hello, world! (from process #7)

POSIX Thread Standard API on Linux for controlling threads.  Portable Operating System Interface Thread handling  pthread_create(); pthread_join(); pthread_exec(); Synchronization  mutex pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock();  Condition variable: Semaphore pthread_cond_signal(); pthread_cond_wait(); etc.

OpenMP #include int main() { pragma omp parallel { int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes) } return 0; } Multiple threads are generated by using pragma. Variables declared globally can be shared.

Convenient pragma for parallel execution #pragma omp parallel { #pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; } The assignment between i and thread is automatically adjusted in order that the load of each thread becomes even.

CUDA/OpenCL CUDA is developed for GPGPU programming. SPMD(Single Program Multiple Data) 3-D management of threads 32 threads are managed with a Warp  SIMD programming Architecture dependent memory model OpenCL is standard language for heterogeneous accelerators.

Heterogeneous Programming with CUDA … Host Device … Host Device Serial Code Parallel Kernel KernelA(args); Serial Code Parallel Kernel KernelB(args);

Threads and thread blocks 01234567 Thread Block 0 threadID … float x = input[threadID]; float y=func(x); output[threadID]=y; … 01234567 Thread Block 1 … float x = input[threadID]; float y=func(x); output[threadID]=y; … 01234567 Thread Block N-1 … float x = input[threadID]; float y=func(x); output[threadID]=y; … … Kernel = grid of thread blocks each thread executes the same code Threads in the same block may synchronize with barriers. _syncthreads(); Thread blocks cannot synchronize -> Execution is depending on machines.

Memory Hierarchy Thread Per-thread Local memory Block Per-block Shared Memory … … Per-device Global Memory Kernel 0 Kernel 1 Sequential Kernels Between host memory cudaMemcpy(); is used.

CUDA extensions Declaration specifiers :  _global_ void kernelFunc(…); // kernel function, runs on device  _device_ int GlobalVar; //variable in device memory  _shared_ int sharedVar; //variable in per-block shared memory Extend function invocation syntax for paralell kernel launch  KernelFunc >> // launch dimGrid blocks with dimBlock threads each Special variables for thread identification in kernels  dim3 threadIDx; dim3 blockIdx; dim3 block Dim; dim3 gridDim; Barrier Synchronization between threads  _syncthreads();

CUDA runtime Device menagement:  cudaGetDeviceCount(), cudaGetDeviceProperties(); Device memory management:  cudaMalloc(), cudaFree(),cudaMemcpy() Graphics interoperability:  cudaGLMapBufferObject(), cudaD3D9MapResources() Texture management:  cudaBindTexture(), cudaBindTextureToArray()

Example: Increment Array Elements void increment_cpu(float *a, float b, int N) { for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b; } void main() { … increment_cpu(a,b,N); } _global_ void increment_gpu(float *a, float b, int N) { int idx=blockidx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu >(a,b,N); }

Example: Increment Array Elements blockIdx.x=0 blockDim.x=4 threadIdx.x=0,1,2,3 idx=0,1,2,3 blockIdx.x=1 blockDim.x=4 threadIdx.x=0,1,2,3 idx=4,5,6,7 blockIdx.x=2 blockDim.x=4 threadIdx.x=0,1,2,3 idx=8,9,10,11 blockIdx.x=3 blockDim.x=4 threadIdx.x=0,1,2,3 idx=12,13,14,15 Let’s assume N=16, blockDim=4 int idx = blockDim.x * blockId.x + threadldx.x; will map from local index threadIdx to global index. blockDim should be >= 32 in real code! Using more number of blocks hides the memory latency in GPU.

Host code // allocate host memory unsigned int numBytes = N*sizeof(float); float * h_A = (float *) malloc(numBytes); // allocate device memory float* d_A=0; cudaMalloc((void**)&d_a, numBytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // execute the kernel increment_cpu >>(d_A,b); // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);

PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors … Thread Execution Manager Input Assembler Host Load/Store Global Memory GeForce GTX280 240 cores

Hardware Implementation: Execution Model Kernel 1 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid1 DeviceHost Kernel 2 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid2 Thread (0,0) … Thread (31,0) Thread (32,0) … Thread (63,0) Warp 0Warp 1 Thread (0,1) … Thread (31,1) Thread (32,1) … Thread (63,1) Warp 2Warp 3 Thread (0,2) … Thread (31,2) Thread (32,2) … Thread (63,2) Warp 4Warp 5 Block (1,1) A multiprocessor executes the same instruction on a group of threads called a warp. Warp size= the number of threads in a warp

Automatic parallelizing Compilers Automatically translating a code for uniprocessors into multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets  No pointers  The array structure is simple Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS

Shared memory model vs ． Message passing model Benefits  Distributed OS is easy to implement.  Automatic parallelize compiler.  POSIX thread, OpenMP Message passing  Formal verification is easy (Blocking)  No-side effect (Shared variable is side effect itself)  Small cost

Parallel Programming Contest In this lecture, a parallel programming contest will be held. All students who want to get the credit must join it.  At least, the program must correctly run.  For students with good achievement, the credit will be given unconditionally.

Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing.

Similar presentations

Presentation on theme: "Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing.

Similar presentations

Presentation on theme: "Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing."— Presentation transcript:

Similar presentations

About project

Feedback