Download presentation
Presentation is loading. Please wait.
1
Exploiting Parallelism
We can exploit parallelism in two ways: At the instruction level Single Instruction Multiple Data (SIMD) Make use of extensions to the ISA At the core level Rewrite the code to parallelize operations across many cores Make use of extensions to the programming language
2
Exploiting Parallelism at the Instruction level (SIMD)
Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } + Operating on one element at a time
3
Exploiting Parallelism at the Instruction level (SIMD)
Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operating on one element at a time +
4
Exploiting Parallelism at the Instruction level (SIMD)
Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operate on MULTIPLE elements + + + + Single Instruction, Multiple Data (SIMD)
5
Intel SSE/SSE2 as an example of SIMD
Added new 128 bit registers (XMM0 – XMM7), each can store 4 single precision FP values (SSE) 4 * 32b 2 double precision FP values (SSE2) 2 * 64b 16 byte values (SSE2) 16 * 8b 8 word values (SSE2) 8 * 16b 4 double word values (SSE2) 4 * 32b bit integer value (SSE2) 1 * 128b 4.0 (32 bits) + 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits) -1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits) 2.5 (32 bits)
6
SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.
7
Is it always that easy? No, not always. Let’s look at a little more challenging one: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0; i < length; ++i) { total += array[i]; } return total; Can we exploit SIMD-style parallelism?
8
We first need to restructure the code
unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;
9
Then we can write SIMD code for the hot part
unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;
10
Exploiting Parallelism at the core level
Consider the code from the second practice Midterm: for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } parallel_for split the iterations across multiple cores (fork/join) If there is a choice, what should we parallelize? Inner loop? Outer? Both? How are N iterations split across c cores? row i c[i] Core 0 Core 1 [0, N/c) [N/c, 2N/c) ... Core 0 Core 1 0,c,2c,… 1,c+1,2c+1,… ...
11
OpenMP Many APIs exist/being developed to exploit multiple cores
OpenMP is easy to use, but has limitations #include <omp.h> #pragma omp parallel for for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } Compile your code with: g++ –fopenmp mp6.cxx Google for OpenMP LLNL for a handy reference-page
12
Race conditions If we parallelize the inner loop, the code produces wrong answers: for(int i = 0; i < N; ++i) { c[i] = 0; #pragma omp parallel for for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } In each iteration, each thread does: What if this happens: load c[i] compute new c[i] update c[i] thread 1 thread 2 load c[i] compute c[i] update c[i] data race
13
Another example Assume a[N][N] and b[N] are integer arrays as before:
int total = 0; #pragma omp parallel for (plus a bit more!) for(int i = 0; i < N; ++i) { for(int j = 0; j < N; ++j) total += a[i][j] * b[j]; } Which loop should we parallelize? Answer: Outer loop, if we can avoid the race condition Each thread maintains a private copy of total (initialized to zero) When the threads are done, the private copies are reduced into one Works because + (and many other operations) are associative
14
Let’s try the version with race conditions
We’re expecting a significant speedup on the 8-core EWS machines Not quite 8-times speedup because of Amdahl’s Law, plus the overhead for forking and joining multiple threads But its actually slower!! Why?? Here’s the mental picture that we have: two processors, shared memory total shared variable in memory
15
This mental picture is wrong!
We’ve forgotten about caches! The memory may be shared, but each processor has its own L1 cache As each processor updates total, it bounces between L1 caches Multiple bouncing slows performance
16
Summary Performance is of primary concern in some applications
Games, servers, mobile devices, super computers Many important applications have parallelism Exploiting it is a good way to speed up programs. Single Instruction Multiple Data (SIMD) does this at ISA level Registers hold multiple data items, instruction operate on them Can achieve factor or 2, 4, 8 speedups on kernels May require some restructuring of code to expose parallelism Exploiting core-level parallelism Watch out for data races! (Makes code incorrect)
17
Cachegrind comparison
Version 1 Version 2 I refs: ,147, ,324,684 I1 misses: , ,078 L2i misses: , ,059 I1 miss rate: % % L2i miss rate: % % D refs: ,255, ,009 D1 misses: , ,190 L2d misses: , ,942 D1 miss rate: % % L2d miss rate: % % L2 refs: , ,268 L2 misses: , ,001 L2 miss rate: % %
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.