Download presentation
Presentation is loading. Please wait.
Published byValentine Gibbs Modified over 9 years ago
1
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos
2
What is Serial Computing? Traditionally, software has been written for serial computation: – To be run on a single computer having a single Central Processing Unit (CPU); – A problem is broken into a discrete series of instructions. – Instructions are executed one after another. – Only one instruction may execute at any moment in time.
3
Serial Computing:
4
What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs A problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different CPUs
5
Parallel Computing
6
Computer Architecture(von Neumann) Comprised of four main components: – Memory – Control Unit – Arithmetic Logic Unit – Input/Output Read/write, random access memory is used to store both program instructions and data – Program instructions are coded data which tell the computer to do something – Data is simply information to be used by the program Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. Aritmetic Unit performs basic arithmetic operations Input/Output is the interface to the human operator
7
UMA, or Uniform Memory Access In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:
8
UMA, or Uniform Memory Access UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.
9
NUMA(Non-Uniform Memory Access) In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly and with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:
10
NUMA(Non-Uniform Memory Access) What gives NUMA its name is that memory access time varies with the location of the data to be accessed. If data resides in local memory, access is fast. If data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.
11
Modern multiprocessor systems In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node". Processors within a node share access to memory modules as per the UMA shared memory architecture. At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.
12
Distributed computing A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.
13
Parallel algorithm for Distributed Memory Computing We assumed that we have these numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] and we want to add them with a parallel algorithm for 4 CPUs Solution: CPU0 CPU1 CPU2 CPU3 10, 11, 12 4, 5, 6 7,8,9 1, 2, 36 15 24 33 6 + 15 + 24 + 33 = 78 CPU0
14
What’s the benefits from a parallel program We assume that Tn is the time to pass a Message through the Network and To it’s the time of an operation to execute. In our example we would need: Tn + 3To + Tn + 4To = 2Tn + 4To For a serial program: 12To We assume that To=1 and Tn=10 Parallel = 2x10 + 4x1 = 24 Serial = 12x1 = 12 Conclusion Serial is faster than parallel
15
What’s the benefits from a parallel program We assume that we have 12,000 numbers to add Parallel: Tn + 3,000To + Tn + 4To = 10 + 3,000x1 + 10 +4x1 = 3,024 Serial: 12,000To = 12,000x1 = 12,000 Conclusion parallel it’s about 4 times faster than serial Parallel computing is begum beneficial for large scale computational problems
16
MPICH MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. Message Passing Interface (MPI) is a specification for an API that allows many computers to communicate with one another. MPICH is a library for C/C++ or Fortran
17
Installation MPICH for Linux Web page: http://www.mcs.anl.gov/research/projects/mpi/mpich1/ http://www.mcs.anl.gov/research/projects/mpi/mpich1/ Download for Linux: ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz Untar: tar xvfz mpich.tar.gz Configure: as root:./configure - -prefix=/usr/local –rsh=ssh as user:./configure - -prefix=/home/username –rsh=ssh Compile: make Install: make install
18
Testing MPICH $ which mpicc It should give a path of mpicc where we install it like: ~/bin/mpicc and the same for mpirun To run a test: from mpich installation dir $cd examples/basic $make $mpirun –np 2 cpi result: Process 0 of 2 on localhost.localdomain pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000000 Process 1 of 2 on localhost.localdomain
19
Possible Errors Not find the path of mpicc $cd ~ $gedit.bashrc add the following line at the bottom export PATH=$PATH:/path_of_mpich/bin save and relogin When we run: mpirun -np 2 cpi p0_29223: p4_error: Could not gethostbyname for host buster.localdomain; may be invalid name that means it cannot resolve buster.localdomain that is our hostname as root: gedit /etc/hosts locate 127.0.0.1 and add at the and of this line the hostname example: 127.0.0.1 localhost.localdomain localhost buster.localdomain
20
ssh login without password To avoid typing our password as many times as the np value we can make an login without password $ssh-keygen by finishing this process it will create two files $cd ~.ssh $ls id_rsa id_rsa.pub known_hosts $cp id_rsa.pub authorized_keys2 So when we do $ssh localhost it will login without password
21
hello.c mpich program #include #include “mpi.h” main(int argc, char** argv){ int my_rank; int size; int namelen; char proc_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(proc_name, &namelen); if (my_rank == 2) printf(“Hello – I am process 2\n”); else printf(“Hello from process %d of %d on %s\n”, my_rank, size, proc_name); MPI_Finalize(); }
22
Run hello.c $mpicc hello.c $mpirun –np 4 a.out result: Hello from process 0 of 4 on localhost.localdomain Hello from process 1 of 4 on localhost.localdomain Hello from process 3 of 4 on localhost.localdomain Hello - I am process 2 NOTE: the results may displayed by different order that’s depends on how the operating system manages the processes
23
From Documentation of MPICH http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs.html MPI_MAX_PROCESSOR_NAME Maximum length of name returned by MPI_GET_PROCESSOR_NAME MPI_Init Initialize the MPI execution environment MPI_Comm_rank Determines the rank of the calling process in the communicator MPI_Comm_size Determines the size of the group associated with a communicator MPI_Get_processor_name Gets the name of the processor MPI_Finalize Terminates MPI execution environment
24
Prepare data for parallel sum if (my_rank == 0){ //ON CPU0 array_size = 12; for(i=0;i<array_size;i++) data[i] = i+1 ; //FILL THE data array 1,2,3,4.. 12 for (target = 1; target < p; target++) MPI_Send(&array_size, 1, MPI_INT, target, tag1, MPI_COMM_WORLD); //send array size to the rest CPUs loc_array_size = array_size/p; //calculate locale array size k = loc_array_size; for(target = 1; target < p; target++){ MPI_Send(&data[k], loc_array_size, MPI_INT, target, tag2, MPI_COMM_WORLD); //send data to the rest CPUs k+=loc_array_size; } //k = 3,6,9,12 for(k=0; k<loc_array_size; k++) data_loc[k]=data[k]; //initialize CPU0 local array } else{ MPI_Recv(&array_size, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status); //receive array size from CPU0 loc_array_size = array_size/p; MPI_Recv(&data_loc[0], loc_array_size, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); //receive locale array from CPU0 }
25
Parallel sum res = 0; //parallel sum for (k=0; k<loc_array_size; k++) res = res + data_loc[k]; if (my_rank != 0){ MPI_Send(&res, 1, MPI_INT, 0, tag3, MPI_COMM_WORLD); //send result to CPU0 } else{ finres = res; //res of CPU0 printf("\n Result of process %d: %d\n", my_rank, res); for (source = 1; source < p; source++) { MPI_Recv(&res, 1, MPI_INT, source, tag3, MPI_COMM_WORLD, &status); //receive results from CPUs finres = finres + res; printf("\n Result of process %d: %d\n", source, res); } printf("\n\n\n Final Result: %d\n", finres); } MPI_Finalize();
26
Parallel Sum Output $ mpirun -np 4 a.out Result of process 0: 6 Result of process 1: 15 Result of process 2: 24 Result of process 3: 33 Final Result: 78
27
MPI_Send Performs a basic send Synopsis – #include "mpi.h" int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Input Parameters – buf initial address of send buffer (choice) – count number of elements in send buffer (nonnegative integer) – datatype datatype of each send buffer element (handle) – dest rank of destination (integer) – tag message tag (integer) – comm communicator (handle)
28
MPI_Recv Basic receive Synopsis – #include "mpi.h" int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Output Parameters – buf initial address of receive buffer (choice) – status status object (Status) Input Parameters – count maximum number of elements in receive buffer (integer) – datatype datatype of each receive buffer element (handle) – source rank of source (integer) – tag message tag (integer) – comm communicator (handle)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.