Exploiting Parallelism

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Machine Instructions Operations

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Introduction to MMX, XMM, SSE and SSE2 Technology

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Single Node Optimization Computational Astrophysics.

University of Washington Today Quick review? Parallelism Wrap-up 

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 Lecture 5a: CPU architecture 101 boris.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Computer Organization Exam Review CS345 David Monismith.

Computers’ Basic Organization

Single Instruction Multiple Data

CSE 351 Section 9 3/1/12.

Morgan Kaufmann Publishers Arithmetic for Computers

A Closer Look at Instruction Set Architectures

SIMD Multimedia Extensions

Lecture 5: Shared-memory Computing with Open MP

Atomic Operations in Hardware

Atomic Operations in Hardware

Getting Started with Automatic Compiler Vectorization

Morgan Kaufmann Publishers

Compilers for Embedded Systems

Vector Processing => Multimedia

Instructor’s Intent for this course

SIMD Programming CS 240A, 2017.

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Arrays & Functions Lesson xx

Multi-core CPU Computing Straightforward with OpenMP

November 14 6 classes to go! Read

Assembly Language Review

Assembly Language Review

Computer Instructions

General Optimization Issues

EE 193: Parallel Computing

Multithreading Why & How.

M4 and Parallel Programming

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Loop-Level Parallelism

Lecture 4: Instruction Set Design/Pipelining

1D Arrays and Lots of Brackets

Lecture 11: Machine-Dependent Optimization

9/27: Lecture Topics Memory Data transfer instructions

Introduction to Optimization

August 31 addresses Drop box Questions? 31 August 2004

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Exploiting Parallelism We can exploit parallelism in two ways: At the instruction level Single Instruction Multiple Data (SIMD) Make use of extensions to the ISA At the core level Rewrite the code to parallelize operations across many cores Make use of extensions to the programming language

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } + Operating on one element at a time

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operating on one element at a time +

Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } Operate on MULTIPLE elements + + + + Single Instruction, Multiple Data (SIMD)

Intel SSE/SSE2 as an example of SIMD Added new 128 bit registers (XMM0 – XMM7), each can store 4 single precision FP values (SSE) 4 * 32b 2 double precision FP values (SSE2) 2 * 64b 16 byte values (SSE2) 16 * 8b 8 word values (SSE2) 8 * 16b 4 double word values (SSE2) 4 * 32b 1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits) + 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits) -1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits) 2.5 (32 bits)

SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.

Is it always that easy? No, not always. Let’s look at a little more challenging one: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0; i < length; ++i) { total += array[i]; } return total; Can we exploit SIMD-style parallelism?

We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;

Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; return total;

Exploiting Parallelism at the core level Consider the code from the second practice Midterm: for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } parallel_for split the iterations across multiple cores (fork/join) If there is a choice, what should we parallelize? Inner loop? Outer? Both? How are N iterations split across c cores?  row i c[i] Core 0 Core 1 [0, N/c) [N/c, 2N/c) ... Core 0 Core 1 0,c,2c,… 1,c+1,2c+1,… ...

OpenMP Many APIs exist/being developed to exploit multiple cores OpenMP is easy to use, but has limitations #include <omp.h> #pragma omp parallel for for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } Compile your code with: g++ –fopenmp mp6.cxx Google for OpenMP LLNL for a handy reference-page

Race conditions If we parallelize the inner loop, the code produces wrong answers: for(int i = 0; i < N; ++i) { c[i] = 0; #pragma omp parallel for for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } In each iteration, each thread does: What if this happens: load c[i] compute new c[i] update c[i] thread 1 thread 2 load c[i] compute c[i] update c[i] data race

Another example Assume a[N][N] and b[N] are integer arrays as before: int total = 0; #pragma omp parallel for (plus a bit more!) for(int i = 0; i < N; ++i) { for(int j = 0; j < N; ++j) total += a[i][j] * b[j]; } Which loop should we parallelize? Answer: Outer loop, if we can avoid the race condition Each thread maintains a private copy of total (initialized to zero) When the threads are done, the private copies are reduced into one Works because + (and many other operations) are associative

Let’s try the version with race conditions We’re expecting a significant speedup on the 8-core EWS machines Not quite 8-times speedup because of Amdahl’s Law, plus the overhead for forking and joining multiple threads But its actually slower!! Why?? Here’s the mental picture that we have: two processors, shared memory total shared variable in memory

This mental picture is wrong! We’ve forgotten about caches! The memory may be shared, but each processor has its own L1 cache As each processor updates total, it bounces between L1 caches Multiple bouncing slows performance

Summary Performance is of primary concern in some applications Games, servers, mobile devices, super computers Many important applications have parallelism Exploiting it is a good way to speed up programs. Single Instruction Multiple Data (SIMD) does this at ISA level Registers hold multiple data items, instruction operate on them Can achieve factor or 2, 4, 8 speedups on kernels May require some restructuring of code to expose parallelism Exploiting core-level parallelism Watch out for data races! (Makes code incorrect)

Cachegrind comparison Version 1 Version 2 I refs: 107,147,147 1,324,684 I1 misses: 1,412 1,078 L2i misses: 1,372 1,059 I1 miss rate: 0.00% 0.08% L2i miss rate: 0.00% 0.07% D refs: 48,255,513 446,009 D1 misses: 352,425 8,190 L2d misses: 17,279 4,942 D1 miss rate: 0.7% 1.8% L2d miss rate: 0.0% 1.1% L2 refs: 353,837 9,268 L2 misses: 18,651 6,001 L2 miss rate: 0.0% 0.3%