Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Flynn’s Taxonomy SISD = Single Instruction Single Data = Serial Programming SIMD = Single Instruction Multiple Data = Implicit Parallelism (Instruction/Architecture Level) MISD = Multiple Instruction Single Data (Rarely implemented) MIMD = Multiple Instruction Multiple Data = Multiprocessor Single DataMultiple Data Single InstructionSISDSIMD Multiple InstructionMISDMIMD

Flynn’s Taxonomy SIMD instructions and architectures allow for implicit parallelism when writing programs To provide a sense of how these work, examples are shown in the following slides. Our focus on MIMD is through the use of processes and threads, and examples are shown in later slides.

Understanding SIMD Instructions Implicit parallelism occur via AVX (Advanced Vector Extensions) or SSE (Streaming SIMD Instructions) Example: without SIMD the following loop might be executed with four add instructions: //Serial Loop for(int i = 0; i < n; i+=4) { c[i] = a[i] + b[i]; //add c[i], a[i], b[i] c[i+1] = a[i+1] + b[i+1]; //add c[i+1], a[i+1], b[i+1] c[i+2] = a[i+2] + b[i+2]; //add c[i+2], a[i+2], b[i+2] c[i+3] = a[i+3] + b[i+3]; //add c[i+3], a[i+3], b[i+3] }

Understanding SIMD Instructions With SIMD the following loop might be executed with one add instruction: //SIMD Loop for(int i = 0; i < n; i+=4) { c[i] = a[i] + b[i]; //add c[i to i+3], a[i to i+3], b[i to i+3] c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3]; }

Understanding SIMD Instructions Note that the add instructions above are pseudo-assembly instructions The serial loop is implemented as follows: +------+ +------+ +------+ | a[i] | + | b[i] | -> | c[i] | +------+ +------+ +------+ |a[i+1]| + |b[i+1]| -> |c[i+1]| +------+ +------+ +------+ |a[i+2]| + |b[i+2]| -> |c[i+2]| +------+ +------+ +------+ |a[i+3]| + |b[i+3]| -> |c[i+3]| +------+ +------+ +------+

Understanding SIMD Instructions Versus SIMD: +------+ +------+ +------+ | a[i] | | b[i] | | c[i] | | | | |a[i+1]| |b[i+1]| |c[i+1]| | | + | | -> | | |a[i+2]| |b[i+2]| |c[i+2]| | | | |a[i+3]| |b[i+3]| |c[i+3]| +------+ +------+ +------+

Understanding SIMD Instructions In the previous example 4x Speedup was achieved by using SIMD instructions Note that SIMD Registers are often 128, 256, or 512 bits wide allowing for addition, subtraction, multiplication, etc., of 2, 4, or 8 double precision variables. Performance of SSE and AVX Instruction Sets, Hwancheol Jeong, Weonjong Lee, Sunghoon Kim, and Seok-Ho Myung, Proceedings of Science, 2012, http://arxiv.org/pdf/1211.0820.pdf

Processes and Threads These exist only at execution time They have fast state changes -> in memory and waiting A Process – is a fundamental computation unit – can have one or more threads – is handled by process management module – requires system resources

Process Process (job) - program in execution, ready to execute, or waiting for execution A program is static whereas a process (running program) is dynamic. In Operating Systems (cs550) we will implement processes using an API called the Message Passing Interface (MPI). MPI will provide us with an abstract layer that will allow us to create and identify processes without worrying about the creation of data structures for sockets or shared memory.

Threads Threads - lightweight processes – Dynamic component of processes – Often, many threads are part of a process Current OSes and Hardware support multithreading – multiple threads (tasks) per process – One or more threads per CPU-core Execution of threads is handled more efficiently than that of full weight processes (although there are other costs). At process creation, one thread is created, the "main" thread. Other threads are created from the "main" thread

Embarrassingly Parallel (Map) Processes and threads are MIMD. Performing array (or matrix) addition is a straightforward example that is easily parallelized The serial example of this follows: for(i = 0; i < N; i++) C[i] = A[i] + B[i]; OpenMP allows you to write a #pragma to parallelize code that you write in a serial (normal) fashion. Three OpenMP parallel versions follow on the next slides

OpenMP First Try We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; } Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i Arrays A, B, and C are declared shared because they are shared between threads

OpenMP for clause It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows #pragma omp parallel private(i) shared(A,B,C) { #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i]; } Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread

Shortened OpenMP for When using a single for loop, the parallel and for clauses may be combined #pragma omp parallel for private(i) \ shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Similar presentations

Presentation on theme: "Computer Organization David Monismith CS345 Notes to help with the in class assignment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Similar presentations

Presentation on theme: "Computer Organization David Monismith CS345 Notes to help with the in class assignment."— Presentation transcript:

Similar presentations

About project

Feedback