Chapter 2 Parallel Programming background

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Lecture 6: Multicore Systems
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
1 Concurrency: Deadlock and Starvation Chapter 6.
Chapter 3 Memory Management: Virtual Memory
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.
Processor Architecture
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Process Management Deadlocks.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Computer Organization
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Understanding Operating Systems Seventh Edition
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Operating Systems (CS 340 D)
Parallel Processing - introduction
Simultaneous Multithreading
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Pipelining and Vector Processing
The University of Adelaide, School of Computer Science
Page Replacement.
Chapter 17 Parallel Processing
COT 5611 Operating Systems Design Principles Spring 2014
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
Operating System 4 THREADS, SMP AND MICROKERNELS
Threads Chapter 4.
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
CSE 542: Operating Systems
COT 5611 Operating Systems Design Principles Spring 2014
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Chapter 2 Parallel Programming background “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”

Traditional Parallel Models Serial Model SISD Parallel Models SIMD MIMD MISD* S = Single M = Multiple D = Data I = Instruction

Vocabulary & Notation (2.1) Task vs. Data: tasks are instructions that operate on data; modify or create new Parallel computation  multiple tasks Coordinate, manage, Dependencies Data: task requires data from another task Control: events/steps must be ordered (I/O)

Task Management – Fork-Join Fork: split control flow, creating new control flow Join: control flows are synchronized & merged

Graphical Notation – Fig. 2.1 Task Data Fork Join Dependency

Strategies (2.2) Data Parallelism Best strategy for Scalable Parallelism P. that grows as data set/problem size grows Split data set over set of processors with task processing each set More Data  More Tasks

Strategies Control Parallelism or Functional Decomposition Different program functions run in parallel Not scalable – best speedup is constant factor As data grows, parallelism doesn’t May be less/no overhead

Regular vs. Irregular Parallelism Regular: tasks are similar with predictable dependencies Matrix multiplication Irregular: tasks are different in ways that create unpredictable dependencies Chess program Many problems contain combinations

Hardware MechanisMS (2.3) Most important 2 Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism

Branch statements Detrimental to Parallelism Locality Pipelining HOW?

Masking - All control paths are executed but results are masked out – not used if (a&1) a = 3*a + 1 else a=a/2 if/else contains branch statements Masking: Both parts are executed in parallel, keep only one result p = (a&1) t = 3*A + 1 if (p) a = t t = a/2 if (!p) a = t No branches – single control of flow Masking works as if it were coded this way

Machine Models (2.4) Core Functional Units Registers Cache memory – multiple levels

Cache Memory Blocks (cache lines) – amount fetched Bandwidth – amount transferred concurrently Latency – time to complete transfer Cache Coherence – consistency among copies

Virtual Memory Memory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessing Swaps Pages HW maps logical to physical address Data locality important to efficiency Page Fault  Thrashing

Parallel Memory Access Cache (multiple) NUMA – Non-Uniform Memory Access PRAM – Parallel Random Access Memory Model Theoretical Model Assumes - Uniform memory access times

Performance Issues (2.4.2) Data Locality Choose code segments that fit in cache Design to use data in close proximity Align data with cache lines (blocks) Dynamic Grain Size – good strategy

Performance issues Arithmetic Intensity Large number of on-chip compute operations for every off-chip memory access Otherwise, communication overhead is high Related – Grain size

Flynn’s Categories Serial Model SISD Parallel Models SIMD – MIMD Array processor Vector processor MIMD Heterogeneous computer Clusters MISD* - not useful

Classification based on memory Shared Memory – each processor accesses a common memory Access issues No message passing PC usually has small local memory Distributed Memory – each processor has a local memory Send explicit messages between processors

Evolution (2.4.4) GPU – Graphics accelerators Now general purpose Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s) Heterogeneous – different (hardware working together) Host Processor – for distribution, I/O, etc.

Performance (2.5) Various interpretations of Performance Reduce Total Time for computation Latency Increasing Rate at which series of results are computed Throughput Reduce Power Consumption *Performance Target

Latency & Throughput (2.5.1) Latency: time to complete a task Throughput: rate at which tasks are complete Units per time (e.g. jobs per hour)

Omit Section 2.5.3 – Power

Speedup & Efficiency (2.5.2) Sp = T1 / Tp T1: time to complete on 1 processor Tp: time to complete on p processors REMEMBER: “time” means number of instructions E = Sp / P = T1 / P*Tp E = 1 is “perfect” Linear Speedup – occurs when algorithm runs P-times faster on P processors

SuperLinear Speedup (p.57) Efficiency > 1 Very Rare Often due to HW variations (cache) Working in parallel may eliminate some work that is done when serial

Amdahl & Gustafson-Barsis (2.5.4, 2.5.5) Amdahl: speedup is limited by amount of serial work required G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases See examples

Work Total operations (time) for task T1 = Work P * Tp = Work T1 = P * Tp ?? Rare due to ???

Work-Span Model (2.5.6) Describes Dependencies among Tasks & allows for estimated times Represents Tasks: DAG (Figure 2.8) Critical Path – longest path Span - minimum time of Critical Path Assumes Greedy Task Scheduling – no wasted resources, time Parallel Slack – excess parallelism, more tasks than can be scheduled at once

Work-Span Model Speedup <= Work/Span Upper Bound: ?? No more than…

Parallel Slack Decomposing a program or data set into more parallelism than hardware can utilize WHY? Advantages? Disadvantages?

Asymptotic Complexity (2.5.7) Comparing Algorithms!! Time Complexity: defines execution time growth in terms of input size Space Complexity: defines growth of memory requirements in terms of input size Ignores constants Machine independent

Big Oh notation (p.66) Big OH of F(n) – Upper Bound O(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No *Memorize

Big Omega & Big Theta Big Omega – Functions that define Lower Bound Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds

Concurrency vs. Parallel Parallel  work actually occurring at same time Limited by number of processors Concurrent  tasks in progress at same time but not necessarily executing “Unlimited” Omit 2.5.8 & most of 2.5.9

Pitfalls of Parallel Programming (2.6) Pitfalls = Issues that can cause problems Due to dependencies Synchronization – often required Too little  non-determinism Too much  reduces scaling, increases time & may cause deadlock

7 Pitfalls – can hinder parallel speedup Race Conditions Mutual Exclusion & Locks Deadlock Strangled Scaling Lack of Locality Load Imbalance Overhead

Race Conditions (2.6.1) Situation in which final results depend upon order tasks complete work Occurs when concurrent tasks share memory location & there is a write operation Unpredictable – don’t always cause errors Interleaving: instructions from 2 or more tasks are executed in an alternating manner

Race Conditions ~ Example 2.2 Task A A = X A += 1 X = A Task B B = X B += 2 X = B Assume X is initially 0. What are the possible results? So, Tasks A & B are not REALLY independent!

Race Conditions ~ Example 2.3 Task B Y = 1 B = X Task A X = 1 A = Y Assume X & Y are initially 0. What are the possible results?

Solutions to Race Conditions (2.6.2) Mutual Exclusion, Locks, Semaphores, Atomic Operations Mechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start Cause serialization of operations Does not always solve the problem – may still depend upon which task executes first

Deadlock (2.6.3) Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP Recommendations for avoidance Avoid mutual exclusion Hold at most 1 lock at a time Acquire locks in same order

Deadlock – Necessary & sufficient conditions 1. Mutual Exclusion Condition: The resources involved are non-shareable. Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources. Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes. 3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted. Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

Strangled Scaling (2.6.4) Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section Notes 1 large lock is faster but blocks other processes Time consideration for set/release of many locks Example: lock row of matrix, not entire matrix

Lack of Locality (2.6.5) Two Assumptions for good locality A core will… Temporal Locality – access same location soon Spatial Locality – access nearby location soon Reminder: Cache Line – block that is retrieved Currently – Cache miss ~~ 100 cycles

Load Imbalance (2.6.6) Uneven distribution of work over processors Related to decomposition of problem Few vs Many Tasks – what are implications?

Overhead (2.6.7) Always in parallel processing Launch, synchronize Small vs larger processors ~ Implications??? ~the end of chapter 2~