Exploiting Parallelism

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Machine Instructions Operations
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Single Node Optimization Computational Astrophysics.
University of Washington Today Quick review? Parallelism Wrap-up 
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Computer Organization Exam Review CS345 David Monismith.
Computers’ Basic Organization
Single Instruction Multiple Data
CSE 351 Section 9 3/1/12.
Morgan Kaufmann Publishers Arithmetic for Computers
A Closer Look at Instruction Set Architectures
SIMD Multimedia Extensions
Lecture 5: Shared-memory Computing with Open MP
Atomic Operations in Hardware
Atomic Operations in Hardware
Getting Started with Automatic Compiler Vectorization
Morgan Kaufmann Publishers
Compilers for Embedded Systems
Vector Processing => Multimedia
Instructor’s Intent for this course
SIMD Programming CS 240A, 2017.
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Arrays & Functions Lesson xx
Multi-core CPU Computing Straightforward with OpenMP
November 14 6 classes to go! Read
Assembly Language Review
Assembly Language Review
Computer Instructions
General Optimization Issues
EE 193: Parallel Computing
Multithreading Why & How.
M4 and Parallel Programming
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
1D Arrays and Lots of Brackets
Lecture 11: Machine-Dependent Optimization
9/27: Lecture Topics Memory Data transfer instructions
Introduction to Optimization
August 31 addresses Drop box Questions? 31 August 2004
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Exploiting Parallelism We can exploit parallelism in two ways: At the instruction level Single Instruction Multiple Data (SIMD) Make use of extensions to the ISA At the core level Rewrite the code to parallelize operations across many cores Make use of extensions to the programming language

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } + Operating on one element at a time

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operating on one element at a time +

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operate on MULTIPLE elements + + + + Single Instruction, Multiple Data (SIMD)

Intel SSE/SSE2 as an example of SIMD Added new 128 bit registers (XMM0 – XMM7), each can store 4 single precision FP values (SSE) 4 * 32b 2 double precision FP values (SSE2) 2 * 64b 16 byte values (SSE2) 16 * 8b 8 word values (SSE2) 8 * 16b 4 double word values (SSE2) 4 * 32b 1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits) + 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits) -1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits) 2.5 (32 bits)

SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.

Is it always that easy? No, not always. Let’s look at a little more challenging one: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0; i < length; ++i) { total += array[i]; } return total; Can we exploit SIMD-style parallelism?

We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;

Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;

Exploiting Parallelism at the core level Consider the code from the second practice Midterm: for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } parallel_for split the iterations across multiple cores (fork/join) If there is a choice, what should we parallelize? Inner loop? Outer? Both? How are N iterations split across c cores?  row i c[i] Core 0 Core 1 [0, N/c) [N/c, 2N/c) ... Core 0 Core 1 0,c,2c,… 1,c+1,2c+1,… ...

OpenMP Many APIs exist/being developed to exploit multiple cores OpenMP is easy to use, but has limitations #include <omp.h> #pragma omp parallel for for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } Compile your code with: g++ –fopenmp mp6.cxx Google for OpenMP LLNL for a handy reference-page

Race conditions If we parallelize the inner loop, the code produces wrong answers: for(int i = 0; i < N; ++i) { c[i] = 0; #pragma omp parallel for for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } In each iteration, each thread does: What if this happens: load c[i] compute new c[i] update c[i] thread 1 thread 2 load c[i] compute c[i] update c[i] data race

Another example Assume a[N][N] and b[N] are integer arrays as before: int total = 0; #pragma omp parallel for (plus a bit more!) for(int i = 0; i < N; ++i) { for(int j = 0; j < N; ++j) total += a[i][j] * b[j]; } Which loop should we parallelize? Answer: Outer loop, if we can avoid the race condition Each thread maintains a private copy of total (initialized to zero) When the threads are done, the private copies are reduced into one Works because + (and many other operations) are associative

Let’s try the version with race conditions We’re expecting a significant speedup on the 8-core EWS machines Not quite 8-times speedup because of Amdahl’s Law, plus the overhead for forking and joining multiple threads But its actually slower!! Why?? Here’s the mental picture that we have: two processors, shared memory total shared variable in memory

This mental picture is wrong! We’ve forgotten about caches! The memory may be shared, but each processor has its own L1 cache As each processor updates total, it bounces between L1 caches Multiple bouncing slows performance

Summary Performance is of primary concern in some applications Games, servers, mobile devices, super computers Many important applications have parallelism Exploiting it is a good way to speed up programs. Single Instruction Multiple Data (SIMD) does this at ISA level Registers hold multiple data items, instruction operate on them Can achieve factor or 2, 4, 8 speedups on kernels May require some restructuring of code to expose parallelism Exploiting core-level parallelism Watch out for data races! (Makes code incorrect)

Cachegrind comparison Version 1 Version 2 I refs: 107,147,147 1,324,684 I1 misses: 1,412 1,078 L2i misses: 1,372 1,059 I1 miss rate: 0.00% 0.08% L2i miss rate: 0.00% 0.07% D refs: 48,255,513 446,009 D1 misses: 352,425 8,190 L2d misses: 17,279 4,942 D1 miss rate: 0.7% 1.8% L2d miss rate: 0.0% 1.1% L2 refs: 353,837 9,268 L2 misses: 18,651 6,001 L2 miss rate: 0.0% 0.3%