Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

MPI Message Passing Interface
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
CS 140: Models of parallel programming: Distributed memory and MPI.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Reference: / MPI Program Structure.
High Performance Computing
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Message Passing Interface (MPI) Part I NPACI Parallel.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Director of Contra Costa College High Performance Computing Center
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
ECE 1747H : Parallel Programming Message Passing (MPI)
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
CS4230 CS4230 Parallel Programming Lecture 13: Introduction to Message Passing Mary Hall October 23, /23/2012.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
MPI and OpenMP.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to MPI (message passing interface)
NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Message Passing Interface Using resources from
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Introduction to parallel computing concepts and technics
MPI Basics.
Introduction to MPI.
MPI Message Passing Interface
Introduction to MPI CDP.
Introduction to Message Passing Interface (MPI)
Lecture 14: Inter-process Communication
Introduction to parallelism and the Message Passing Interface
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
6- General Purpose GPU Programming
MPI Message Passing Interface
Presentation transcript:

Parallel Programming AMANO, Hideharu

Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing Compilers

Message passing ( Blocking: randezvous ) Send Receive Send Receive

Message passing ( with buffer ) Send Receive Send Receive

Message passing ( non-blocking ) Send Receive Other Job

PVM (Parallel Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is provided. Barrier synchronization

MPI (Message Passing Interface) Superset of the PVM for 1 to 1 communication. Group communication Various communication is supported. Error check with communication tag.

Programming style using MPI SPMD (Single Program Multiple Data Streams)  Multiple processes executes the same program.  Independent processing is done based on the process number. Program execution using MPI  Specified number of processes are generated.  They are distributed to each node of the NORA machine or PC cluster.

Communication methods Point-to-Point communication  A sender and a receiver executes function for sending and receiving.  Each function must be strictly matched. Collective communication  Communication between multiple processes.  The same function is executed by multiple processes.  Can be replaced with a sequence of Point-to-Point communication, but sometimes effective.

Fundamental MPI functions Most programs can be described using six fundamental functions  MPI_Init() … MPI Initialization  MPI_Comm_rank() … Get the process #  MPI_Comm_size() … Get the total process #  MPI_Send() … Message send  MPI_Recv() … Message receive  MPI_Finalize() … MPI termination

Other MPI functions Functions for measurement  MPI_Barrier() … barrier synchronization  MPI_Wtime() … get the clock time Non-blocking function  Consisting of communication request and check  Other calculation can be executed during waiting.

An Example 1: #include 2: #include 3: 4: #define MSIZE 64 5: 6: int main(int argc, char **argv) 7: { 8: char msg[MSIZE]; 9: int pid, nprocs, i; 10: MPI_Status status; 11: 12: MPI_Init(&argc, &argv); 13: MPI_Comm_rank(MPI_COMM_WORLD, &pid); 14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 15: 16: if (pid == 0) { 17: for (i = 1; i < nprocs; i++) { 18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status); 19: fputs(msg, stdout); 20: } 21: } 22: else { 23: sprintf(msg, "Hello, world! (from process #%d)\n", pid); 24: MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD); 25: } 26: 27: MPI_Finalize(); 28: 29: return 0; 30: }

Initialize and Terminate int MPI_Init( int *argc, /* pointer to argc */ char ***argv /* pointer to argv */ ); mpi_init(ierr) integer ierr ! return code The attributes from command line must be passed directly to argc and argv. int MPI_Finalize(); mpi_finalize(ierr) integer ierr ! return code

Commincator functions It returns the rank (process ID) in the communicator comm. int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ ); mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code It returns the total number of processes in the communicator comm. int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ ); mpi_comm_size(comm, size, ierr) integer comm, size integer ierr ! return code Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.

MPI_Send It sends data to process “dest”. int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ ); mpi_send(buf, count, datatype, dest, tag, comm, ierr) buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code Tags are used for identification of message.

MPI_Recv int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ ); mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members: MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.

datatype and count The size of the message is identified with count and datatype.  MPI_CHAR char  MPI_INT int  MPI_FLOAT float  MPI_DOUBLE double … etc.

Compile and Execution % icc –o hello hello.c -lmpi % mpirun –np 8./hello Hello, world! (from process #1) Hello, world! (from process #2) Hello, world! (from process #3) Hello, world! (from process #4) Hello, world! (from process #5) Hello, world! (from process #6) Hello, world! (from process #7)

POSIX Thread Standard API on Linux for controlling threads.  Portable Operating System Interface Thread handling  pthread_create(); pthread_join(); pthread_exec(); Synchronization  mutex pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock();  Condition variable: Semaphore pthread_cond_signal(); pthread_cond_wait(); etc.

OpenMP #include int main() { pragma omp parallel { int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes) } return 0; } Multiple threads are generated by using pragma. Variables declared globally can be shared.

Convenient pragma for parallel execution #pragma omp parallel { #pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; } The assignment between i and thread is automatically adjusted in order that the load of each thread becomes even.

CUDA/OpenCL CUDA is developed for GPGPU programming. SPMD(Single Program Multiple Data) 3-D management of threads 32 threads are managed with a Warp  SIMD programming Architecture dependent memory model OpenCL is standard language for heterogeneous accelerators.

Heterogeneous Programming with CUDA … Host Device … Host Device Serial Code Parallel Kernel KernelA(args); Serial Code Parallel Kernel KernelB(args);

Threads and thread blocks Thread Block 0 threadID … float x = input[threadID]; float y=func(x); output[threadID]=y; … Thread Block 1 … float x = input[threadID]; float y=func(x); output[threadID]=y; … Thread Block N-1 … float x = input[threadID]; float y=func(x); output[threadID]=y; … … Kernel = grid of thread blocks each thread executes the same code Threads in the same block may synchronize with barriers. _syncthreads(); Thread blocks cannot synchronize -> Execution is depending on machines.

Memory Hierarchy Thread Per-thread Local memory Block Per-block Shared Memory … … Per-device Global Memory Kernel 0 Kernel 1 Sequential Kernels Between host memory cudaMemcpy(); is used.

CUDA extensions Declaration specifiers :  _global_ void kernelFunc(…); // kernel function, runs on device  _device_ int GlobalVar; //variable in device memory  _shared_ int sharedVar; //variable in per-block shared memory Extend function invocation syntax for paralell kernel launch  KernelFunc >> // launch dimGrid blocks with dimBlock threads each Special variables for thread identification in kernels  dim3 threadIDx; dim3 blockIdx; dim3 block Dim; dim3 gridDim; Barrier Synchronization between threads  _syncthreads();

CUDA runtime Device menagement:  cudaGetDeviceCount(), cudaGetDeviceProperties(); Device memory management:  cudaMalloc(), cudaFree(),cudaMemcpy() Graphics interoperability:  cudaGLMapBufferObject(), cudaD3D9MapResources() Texture management:  cudaBindTexture(), cudaBindTextureToArray()

Example: Increment Array Elements void increment_cpu(float *a, float b, int N) { for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b; } void main() { … increment_cpu(a,b,N); } _global_ void increment_gpu(float *a, float b, int N) { int idx=blockidx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu >(a,b,N); }

Example: Increment Array Elements blockIdx.x=0 blockDim.x=4 threadIdx.x=0,1,2,3 idx=0,1,2,3 blockIdx.x=1 blockDim.x=4 threadIdx.x=0,1,2,3 idx=4,5,6,7 blockIdx.x=2 blockDim.x=4 threadIdx.x=0,1,2,3 idx=8,9,10,11 blockIdx.x=3 blockDim.x=4 threadIdx.x=0,1,2,3 idx=12,13,14,15 Let’s assume N=16, blockDim=4 int idx = blockDim.x * blockId.x + threadldx.x; will map from local index threadIdx to global index. blockDim should be >= 32 in real code! Using more number of blocks hides the memory latency in GPU.

Host code // allocate host memory unsigned int numBytes = N*sizeof(float); float * h_A = (float *) malloc(numBytes); // allocate device memory float* d_A=0; cudaMalloc((void**)&d_a, numBytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // execute the kernel increment_cpu >>(d_A,b); // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);

PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors PBSM Thread Processors … Thread Execution Manager Input Assembler Host Load/Store Global Memory GeForce GTX cores

Hardware Implementation: Execution Model Kernel 1 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid1 DeviceHost Kernel 2 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid2 Thread (0,0) … Thread (31,0) Thread (32,0) … Thread (63,0) Warp 0Warp 1 Thread (0,1) … Thread (31,1) Thread (32,1) … Thread (63,1) Warp 2Warp 3 Thread (0,2) … Thread (31,2) Thread (32,2) … Thread (63,2) Warp 4Warp 5 Block (1,1) A multiprocessor executes the same instruction on a group of threads called a warp. Warp size= the number of threads in a warp

Automatic parallelizing Compilers Automatically translating a code for uniprocessors into multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets  No pointers  The array structure is simple Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS

Shared memory model vs . Message passing model Benefits  Distributed OS is easy to implement.  Automatic parallelize compiler.  POSIX thread, OpenMP Message passing  Formal verification is easy (Blocking)  No-side effect (Shared variable is side effect itself)  Small cost

Parallel Programming Contest In this lecture, a parallel programming contest will be held. All students who want to get the credit must join it.  At least, the program must correctly run.  For students with good achievement, the credit will be given unconditionally.