Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Similar presentations


Presentation on theme: "University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve."— Presentation transcript:

1 University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster resources Concurrency: Correctly and efficiently manage access to shared resources requests work resource 1

2 University of Washington What is parallel processing? Brief introduction to key ideas of parallel processing  instruction level parallelism  data-level parallelism  thread-level parallelism Winter 2015 Wrap-up 2

3 University of Washington Exploiting Parallelism Of the computing problems for which performance is important, many have inherent parallelism computer games  Graphics, physics, sound, AI etc. can be done separately  Furthermore, there is often parallelism within each of these:  Each pixel on the screen’s color can be computed independently  Non-contacting objects can be updated/simulated independently  Artificial intelligence of non-human entities done independently Search engine queries  Every query is independent  Searches are (pretty much) read-only!! Winter 2015 Wrap-up 3

4 University of Washington Instruction-Level Parallelism add %r2 <- %r3, %r4 or %r2 <- %r2, %r4 lw %r6 <- 0(%r4) addi %r7 <- %r6, 0x5 sub %r8 <- %r8, %r4 Dependencies? RAW – read after write WAW – write after write WAR – write after read When can we reorder instructions? add %r2 <- %r3, %r4 or %r5 <- %r2, %r4 lw %r6 <- 0(%r4) sub %r8 <- %r8, %r4 addi %r7 <- %r6, 0x5 When should we reorder instructions? Superscalar Processors: Multiple instructions executing in parallel at *same* stage Winter 2015 Wrap-up 4

5 University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operating on one element at a time + Winter 2015 Wrap-up 5

6 University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Operating on one element at a time Winter 2015 Wrap-up 6

7 University of Washington Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Data Parallelism with SIMD + Operate on MULTIPLE elements ++ Single Instruction, Multiple Data (SIMD) Winter 2015 Wrap-up 7

8 University of Washington Is it always that easy? Not always… a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; } Is there parallelism here? Each loop iteration uses data from previous iteration. Winter 2015 Wrap-up 8

9 University of Washington Restructure the code for SIMD… // one option... unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; // chunks of 4 at a time for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } // add the 4 sub-totals total = temp[0] + temp[1] + temp[2] + temp[3]; // add the non-4-aligned parts for ( ; i < length ; ++ i) { total += array[i]; } return total; } Winter 2015 Wrap-up 9

10 University of Washington Mar 13 Friday 13, boo!! Last lecture  or ? Winter 201510 Wrap-up

11 University of Washington What are threads? Independent “thread of control” within process Like multiple processes within one process, but sharing the same virtual address space.  logical control flow  program counter  stack  shared virtual address space  all threads in process use same virtual address space Lighter-weight than processes  faster context switching  system can support more threads Winter 201511 Wrap-up

12 University of Washington Thread-level parallelism: multi-core processors Two (or more) complete processors, fabricated on the same silicon chip Execute instructions from two (or more) programs/threads at same time #1 #2 IBM Power5 Winter 2015 Wrap-up 12

13 University of Washington Multi-cores are everywhere (circa 2015) Laptops, desktops, servers,  Most any machine from the past few years has at least 2 cores Game consoles:  Xbox 360: 3 PowerPC cores; Xbox One: 8 AMD cores  PS3: 9 Cell cores (1 master; 8 special SIMD cores); PS4: 8 custom AMD x86-64 cores  Wii U: 2 Power cores Smartphones  iPhone 4S, 5: dual-core ARM CPUs  Galaxy S II, III, IV: dual-core ARM or Snapdragon  … Your home automation nodes… Winter 201513 Wrap-up

14 University of Washington Why Multicores Now? Number of transistors we can put on a chip growing exponentially… But performance is no longer growing along with transistor count. So let’s use those transistors to add more cores to do more at once… Winter 2015 Wrap-up 14

15 University of Washington What happens if we run this program on a multicore? void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } As programmers, do we care? #1#2 Winter 2015 Wrap-up 15

16 University of Washington What if we want one program to run on multiple processors (cores)? We have to explicitly tell the machine exactly how to do this  This is called parallel programming or concurrent programming There are many parallel/concurrent programming models  We will look at a relatively simple one: fork-join parallelism Winter 2015 Wrap-up 16

17 University of Washington How does this help performance? Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processors What can we realistically expect? speedup(p) = Winter 2015 Wrap-up 17

18 University of Washington In general, the whole computation is not (easily) parallelizable Serial regions limit the potential parallel speedup. Reason #1: Amdahl’s Law Serial regions Winter 2015 Wrap-up 18

19 University of Washington Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? What is the maximum speedup from parallelization? Reason #1: Amdahl’s Law New Execution Time = 1-s +s p New Execution Time =.9T +.1T= 3 Speedup = Winter 2015 Wrap-up 19

20 University of Washington — Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system —Depends on the implementation Reason #2: Overhead New Execution Time = 1-s +s+overhead(P) P Winter 201520 Wrap-up

21 University of Washington Multicore: what should worry us? Concurrency: what if we’re sharing resources, memory, etc.? Cache Coherence  What if two cores have the same data in their own caches? How do we keep those copies in sync? Memory Consistency, Ordering, Interleaving, Synchronization…  With multiple cores, we can have truly concurrent execution of threads. In what order do their memory accesses appear to happen? Do the orders seen by different cores/threads agree? Concurrency Bugs  When it all goes wrong…  Hard to reproduce, hard to debug  http://cacm.acm.org/magazines/2012/2/145414-you-dont-know-jack- about-shared-variables-or-memory-models/fulltext http://cacm.acm.org/magazines/2012/2/145414-you-dont-know-jack- about-shared-variables-or-memory-models/fulltext Winter 201521 Wrap-up

22 University of Washington Multicore: more than one processor on the same chip  Almost all devices now have multicore processors  Results from Moore’s law and power constraint Exploiting multicore requires parallel programming  Automatically extracting parallelism too hard for compiler, in general.  But, can have compiler do much of the bookkeeping for us Fork-Join model of parallelism  At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread  Expect a speedup of less than P on P processors  Amdahl’s Law: speedup limited by serial portion of program  Overhead: forking and joining are not free Take 332, 451, 471 to learn more! Summary Winter 2015 Wrap-up 22

23 University of Washington “Innovation is most likely to occur at the intersection of multiple fields or areas of interest.” OS Biology ML PL Hardware Architecture

24 University of Washington Approximate Computing Data and IO centric systems New applications and technologies My current research themes ~ Trading off output quality for better energy efficiency and performance Systems for large-scale data analysis (graphs, images, etc..) Non-volatile memory, DNA-based architectures, energy-harvesting systems, etc…

25 University of Washington image, sound and video processing image rendering sensor data analysis, computer vision ✓ ✓ simulations, games, search, machine learning ✓ ✓ These applications consume a lot (most?) of our cycles/bytes/bandwidth! Often input data is inexact by nature (from sensors) They have multiple acceptable outputs They do not require “perfect execution” These applications consume a lot (most?) of our cycles/bytes/bandwidth! Often input data is inexact by nature (from sensors) They have multiple acceptable outputs They do not require “perfect execution”

26 University of Washington A big idea: Let computers be sloppy!

27 University of Washington “Approximate computing” Performance Resource usage (e.g., energy) Accuracy Sources of systematic accuracy loss: Unsound code transformations, ~2X Unreliable, probabilistic hardware (near/sub-threshold, etc.), ~5X Fundamentally different, inherently inaccurate execution models (e.g., neural networks, analog computing), ~10-100X

28 University of Washington

29 But approximation needs to be done carefully... or...

30 University of Washington

31

32 “Disciplined” approximate programming PreciseApproximate ✗ ✗ ✓ ✓ references jump targets JPEG header pixel data neuron weights audio samples video frames

33 University of Washington Disciplined Approximate Programming (EnerJ, EnerC,...) int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; z = p * 2; p += 4; } a /= 9; p += 10; socket.send(z); write(file, z); Relaxed Algorithms λ Aggressive Compilation ɸ Approximate Data Storage Variable-Accuracy ISA ALU Approximate Logic/Circuits Variable-quality wireless communication Goal: support a wide range of approximation techniques with a single unified abstraction.

34 University of Washington Collaboration with DNA Nanotech Applying architecture concepts to nano-scale molecular circuits. Really new and exciting! Lots of applications: diagnostics, intelligent drugs, art, etc. And now storage!

35 University of Washington Winter 201535 Wrap-up OS Biology ML PL Hardware Architecture https://uw.iasystem.org/survey/141213

36 University of Washington Winter 201536 Wrap-up https://uw.iasystem.org/survey/141213


Download ppt "University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve."

Similar presentations


Ads by Google