“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

Slides:

Advertisements

Similar presentations

Paging: Design Issues. Readings r Silbershatz et al: ,

Advertisements

Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.

Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.

Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Lecture 6: Multicore Systems

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

OS Spring 2004 Concurrency: Principles of Deadlock Operating Systems Spring 2004.

Chapter 17 Parallel Processing.

CPSC 4650 Operating Systems Chapter 6 Deadlock and Starvation

OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

1 Concurrency: Deadlock and Starvation Chapter 6.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Concurrency: Deadlock and Starvation Chapter 6.

Chapter 3 Memory Management: Virtual Memory

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

Processor Architecture

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Outline Why this subject? What is High Performance Computing?

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Parallel Computing Presented by Justin Reschke

Proving Correctness and Measuring Performance CET306 Harry R. Erwin University of Sunderland.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Processor Level Parallelism 1

Buffering Techniques Greg Stitt ECE Department University of Florida.

Process Management Deadlocks.

These slides are based on the book:

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Parallel Processing - introduction

The University of Adelaide, School of Computer Science

Parallel Algorithm Design

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Threads Chapter 4.

Concurrency: Mutual Exclusion and Process Synchronization

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CSE 542: Operating Systems

CSC Multiprocessor Programming, Spring, 2011

Chapter 2 Parallel Programming background

Presentation transcript:

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.” 1

Serial Model  SISD Parallel Models  SIMD  MIMD  MISD*  S = Single  M = Multiple  D = Data  I = Instruction 2

 Task vs. Data: tasks are instructions that operate on data; modify or create new  Parallel computation  multiple tasks  Coordinate, manage,  Dependencies  Data: task requires data from another task  Control: events/steps must be ordered (I/O) 3

 Fork: split control flow, creating new control flow  Join: control flows are synchronized & merged 4

5 Task Data Fork Join Dependency

 Data Parallelism  Best strategy for Scalable Parallelism  P. that grows as data set/problem size grows  Split data set over set of processors with task processing each set  More Data  More Tasks 6

 Control Parallelism or  Functional Decomposition  Different program functions run in parallel  Not scalable – best speedup is constant factor  As data grows, parallelism doesn’t  May be less/no overhead 7

 Regular: tasks are similar with predictable dependencies  Matrix multiplication  Irregular: tasks are different in ways that create unpredictable dependencies  Chess program  Many problems contain combinations 8

Most important 2  Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition  Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism 9

Detrimental to Parallelism Locality Pipelining HOW? 10

if (a&1) a = 3*a + 1 else a=a/2 if/else contains branch statements Masking: Both parts are executed in parallel, keep only one result p = (a&1) t = 3*A + 1 if (p) a = t t = a/2 if (!p) a = t No branches – single control of flow Masking works as if it were coded this way 11

Core  Functional Units  Registers  Cache memory – multiple levels 12

13

 Blocks (cache lines) – amount fetched  Bandwidth – amount transferred concurrently  Latency – time to complete transfer  Cache Coherence – consistency among copies 14

 Memory system  Disk storage + chip memory  Allows programs larger than memory to run  Allows multiprocessing  Swaps Pages  HW maps logical to physical address  Data locality important to efficiency  Page Fault  Thrashing 15

 Cache (multiple)  NUMA – Non-Uniform Memory Access  PRAM – Parallel Random Access Memory Model  Theoretical Model  Assumes - Uniform memory access times 16

 Data Locality  Choose code segments that fit in cache  Design to use data in close proximity  Align data with cache lines (blocks)  Dynamic Grain Size – good strategy 17

 Arithmetic Intensity  Large number of on-chip compute operations for every off-chip memory access  Otherwise, communication overhead is high  Related – Grain size 18

Serial Model  SISD Parallel Models  SIMD –  Array processor  Vector processor  MIMD  Heterogeneous computer  Clusters  MISD* - not useful 19

Shared Memory – each processor accesses a common memory  Access issues  No message passing  PC usually has small local memory  Distributed Memory – each processor has a local memory  Send explicit messages between processors 20

 GPU – Graphics accelerators  Now general purpose  Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)  Heterogeneous – different (hardware working together)  Host Processor – for distribution, I/O, etc. 21

Various interpretations of Performance  Reduce Total Time for computation  Latency  Increasing Rate at which series of results are computed  Throughput  Reduce Power Consumption *Performance Target 22

 Latency: time to complete a task  Throughput: rate at which tasks are complete  Units per time (e.g. jobs per hour) 23

24

S p = T 1 / T p  T 1 : time to complete on 1 processor  T p : time to complete on p processors REMEMBER: “time” means number of instructions E = S p / P = T 1 / P*T p  E = 1 is “perfect”  Linear Speedup – occurs when algorithm runs P-times faster on P processors 25

 Efficiency > 1  Very Rare  Often due to HW variations (cache)  Working in parallel may eliminate some work that is done when serial 26

 Amdahl: speedup is limited by amount of serial work required  G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases  See examples 27

 Total operations (time) for task  T 1 = Work  P * T p = Work  T 1 = P * T p ??  Rare due to ??? 28

 Describes Dependencies among Tasks & allows for estimated times  Represents Tasks: DAG (Figure 2.8)  Critical Path – longest path  Span - minimum time of Critical Path  Assumes Greedy Task Scheduling – no wasted resources, time  Parallel Slack – excess parallelism, more tasks than can be scheduled at once 29

 Speedup <= Work/Span  Upper Bound: ??  No more than… 30

ASYMPTOTIC COMPLEXITY (2.5.7)  Comparing Algorithms!!  Time Complexity: defines execution time growth in terms of input size  Space Complexity: defines growth of memory requirements in terms of input size  Ignores constants  Machine independent 31

BIG OH NOTATION (P.66) Big OH of F(n) – Upper Bound O(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No *Memorize 32

BIG OMEGA & BIG THETA  Big Omega – Functions that define Lower Bound  Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds 33

 Parallel  work actually occurring at same time  Limited by number of processors  Concurrent  tasks in progress at same time but not necessarily executing  “Unlimited” Omit 2.58 & most of

Pitfalls = Issues that can cause problems  Synchronization – often required  Too little  non-determinism  Too much  reduces scaling, increases time & may cause deadlock 35

 Situation in which final results depend upon order tasks complete work  Occurs when concurrent tasks share memory location & there is a write operation  Unpredictable – don’t always cause errors  Interleaving: instructions from 2 or more tasks are executed in an alternating manner 36

Task A A = X A += 1 X = A Task B B = X B += 2 X = B 37 Assume X is initially 0. What are the possible results? So, Tasks A & B are not REALLY independent!

Task A X = 1 A = Y Task B Y = 1 B = X 38 Assume X & Y are initially 0. What are the possible results?

 Mutual Exclusion, Locks, Semaphores, Atomic Operations  Mechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start  Does not always solve the problem – may still depend upon which task executes first 39

40  Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP  Recommendations for avoidance  Avoid mutual exclusion  Hold at most 1 lock at a time  Acquire locks in same order

41 1. Mutual Exclusion Condition: The resources involved are non-shareable. Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources. Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes. 3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted. Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

 Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section  Notes  1 large lock is faster but blocks other processes  Time consideration for set/release of many locks  Example: lock row of matrix, not entire matrix 42

Two Assumptions for good locality  Temporal Locality – access same location soon  Spatial Locality – access nearby location soon  Reminder: Cache Line – block that is retrieved  Currently – Cache miss ~~ 100 cycles 43

 Uneven distribution of work over processors  Related to decomposition of problem  Few vs Many Tasks – what are implications? 44

 Always in parallel processing  Launch, synchronize  Small vs larger processors ~ Implications??? ~the end of chapter 2~ 45