“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.” 1

Serial Model  SISD Parallel Models  SIMD  MIMD  MISD*  S = Single  M = Multiple  D = Data  I = Instruction 2

 Task vs. Data: tasks are instructions that operate on data; modify or create new  Parallel computation  multiple tasks  Coordinate, manage,  Dependencies  Data: task requires data from another task  Control: events/steps must be ordered (I/O) 3

 Fork: split control flow, creating new control flow  Join: control flows are synchronized & merged 4

5 Task Data Fork Join Dependency

 Data Parallelism  Best strategy for Scalable Parallelism  P. that grows as data set/problem size grows  Split data set over set of processors with task processing each set  More Data  More Tasks 6

 Control Parallelism or  Functional Decomposition  Different program functions run in parallel  Not scalable – best speedup is constant factor  As data grows, parallelism doesn’t  May be less/no overhead 7

 Regular: tasks are similar with predictable dependencies  Matrix multiplication  Irregular: tasks are different in ways that create unpredictable dependencies  Chess program  Many problems contain combinations 8

Most important 2  Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition  Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism 9

Detrimental to Parallelism Locality Pipelining HOW? 10

if (a&1) a = 3*a + 1 else a=a/2 if/else contains branch statements Masking: Both parts are executed in parallel, keep only one result p = (a&1) t = 3*A + 1 if (p) a = t t = a/2 if (!p) a = t No branches – single control of flow Masking works as if it were coded this way 11

Core  Functional Units  Registers  Cache memory – multiple levels 12

 Blocks (cache lines) – amount fetched  Bandwidth – amount transferred concurrently  Latency – time to complete transfer  Cache Coherence – consistency among copies 14

 Memory system  Disk storage + chip memory  Allows programs larger than memory to run  Allows multiprocessing  Swaps Pages  HW maps logical to physical address  Data locality important to efficiency  Page Fault  Thrashing 15

 Cache (multiple)  NUMA – Non-Uniform Memory Access  PRAM – Parallel Random Access Memory Model  Theoretical Model  Assumes - Uniform memory access times 16

 Data Locality  Choose code segments that fit in cache  Design to use data in close proximity  Align data with cache lines (blocks)  Dynamic Grain Size – good strategy 17

 Arithmetic Intensity  Large number of on-chip compute operations for every off-chip memory access  Otherwise, communication overhead is high  Related – Grain size 18

Serial Model  SISD Parallel Models  SIMD –  Array processor  Vector processor  MIMD  Heterogeneous computer  Clusters  MISD* - not useful 19

Shared Memory – each processor accesses a common memory  Access issues  No message passing  PC usually has small local memory  Distributed Memory – each processor has a local memory  Send explicit messages between processors 20

 GPU – Graphics accelerators  Now general purpose  Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)  Heterogeneous – different (hardware working together)  Host Processor – for distribution, I/O, etc. 21

Various interpretations of Performance  Reduce Total Time for computation  Latency  Increasing Rate at which series of results are computed  Throughput  Reduce Power Consumption *Performance Target 22

 Latency: time to complete a task  Throughput: rate at which tasks are complete  Units per time (e.g. jobs per hour) 23

S p = T 1 / T p  T 1 : time to complete on 1 processor  T p : time to complete on p processors REMEMBER: “time” means number of instructions E = S p / P = T 1 / P*T p  E = 1 is “perfect”  Linear Speedup – occurs when algorithm runs P-times faster on P processors 25

 Efficiency > 1  Very Rare  Often due to HW variations (cache)  Working in parallel may eliminate some work that is done when serial 26

 Amdahl: speedup is limited by amount of serial work required  G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases  See examples 27

 Total operations (time) for task  T 1 = Work  P * T p = Work  T 1 = P * T p ??  Rare due to ??? 28

 Describes Dependencies among Tasks & allows for estimated times  Represents Tasks: DAG (Figure 2.8)  Critical Path – longest path  Span - minimum time of Critical Path  Assumes Greedy Task Scheduling – no wasted resources, time  Parallel Slack – excess parallelism, more tasks than can be scheduled at once 29

 Speedup <= Work/Span  Upper Bound: ??  No more than… 30

ASYMPTOTIC COMPLEXITY (2.5.7)  Comparing Algorithms!!  Time Complexity: defines execution time growth in terms of input size  Space Complexity: defines growth of memory requirements in terms of input size  Ignores constants  Machine independent 31

BIG OH NOTATION (P.66) Big OH of F(n) – Upper Bound O(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No *Memorize 32

BIG OMEGA & BIG THETA  Big Omega – Functions that define Lower Bound  Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds 33

 Parallel  work actually occurring at same time  Limited by number of processors  Concurrent  tasks in progress at same time but not necessarily executing  “Unlimited” Omit 2.58 & most of 2.59 34

Pitfalls = Issues that can cause problems  Synchronization – often required  Too little  non-determinism  Too much  reduces scaling, increases time & may cause deadlock 35

 Situation in which final results depend upon order tasks complete work  Occurs when concurrent tasks share memory location & there is a write operation  Unpredictable – don’t always cause errors  Interleaving: instructions from 2 or more tasks are executed in an alternating manner 36

Task A A = X A += 1 X = A Task B B = X B += 2 X = B 37 Assume X is initially 0. What are the possible results? So, Tasks A & B are not REALLY independent!

Task A X = 1 A = Y Task B Y = 1 B = X 38 Assume X & Y are initially 0. What are the possible results?

 Mutual Exclusion, Locks, Semaphores, Atomic Operations  Mechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start  Does not always solve the problem – may still depend upon which task executes first 39

40  Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP  Recommendations for avoidance  Avoid mutual exclusion  Hold at most 1 lock at a time  Acquire locks in same order

41 1. Mutual Exclusion Condition: The resources involved are non-shareable. Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources. Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes. 3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted. Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

 Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section  Notes  1 large lock is faster but blocks other processes  Time consideration for set/release of many locks  Example: lock row of matrix, not entire matrix 42

Two Assumptions for good locality  Temporal Locality – access same location soon  Spatial Locality – access nearby location soon  Reminder: Cache Line – block that is retrieved  Currently – Cache miss ~~ 100 cycles 43

 Uneven distribution of work over processors  Related to decomposition of problem  Few vs Many Tasks – what are implications? 44

 Always in parallel processing  Launch, synchronize  Small vs larger processors ~ Implications??? ~the end of chapter 2~ 45

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

Similar presentations

Presentation on theme: "“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

Similar presentations

Presentation on theme: "“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules."— Presentation transcript:

Similar presentations

About project

Feedback