Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Slides:

Advertisements

Similar presentations

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

Advertisements

CSCI 4717/5717 Computer Architecture

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Mutual Exclusion.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

1 Programming with Shared Memory. 2 Shared memory multiprocessor system Any memory location can be accessible by any of the processors. A single address.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CE6105 Linux 作業系統 Linux Operating System 許富皓. Chapter 2 Memory Addressing.

Programming with Shared Memory

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Computer Architecture II 1 Computer architecture II Lecture 9.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Programming with Shared Memory Introduction to OpenMP

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Implementation of a Stored Program Computer ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides2.ppt Modification date: Oct 16,

Shared Memory Consistency Models. Quiz (1)  Let’s define shared memory.

1 Programming with Shared Memory. 2 Shared memory multiprocessor system Any memory location can be accessible by any of the processors. A single address.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Copyright © 1997 – 2014 Curt Hill Concurrent Execution of Programs An Overview.

CSCI-455/552 Introduction to High Performance Computing Lecture 19.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

1 Linux Operating System 許富皓. 2 Memory Addressing.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Processor Architecture

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Concurrency Properties. Correctness In sequential programs, rerunning a program with the same input will always give the same result, so it makes sense.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

CS533 Concepts of Operating Systems Jonathan Walpole.

Lecture on Central Process Unit (CPU)

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

Computer Organization

Software Coherence Management on Non-Coherent-Cache Multicores

Distributed Shared Memory

Advanced OS Concepts (For OCR)

Memory Consistency Models

Memory Consistency Models

Programming with Shared Memory

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Introduction to High Performance Computing Lecture 20

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Threads Chapter 4.

Shared Memory Consistency Models: A Tutorial

Programming with Shared Memory

Programming with Shared Memory Specifying parallelism

Memory Consistency Models

Programming with Shared Memory

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Lecture 11: Consistency Models

Presentation transcript:

slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

2 We have seen OpenMP for specifying parallelism Programmer decides on what parts of the code should be parallelized and inserts compiler directives (pragma’s) Whatever programming environment we use where the programmers explicitly says what should be done in parallel, the issue for the programmer is deciding what can be done in parallel. Let us use generic language constructs for parallelism.

3 par Construct For specifying concurrent statements: par { S1; S2;. Sn; } Says one can execute all statement S1 to Sn simultaneously if resources available, or execute them in any order and still get the correct result

4 Question How is this specified in OpenMP?

5 forall Construct To start multiple similar processes together: forall (i = 0; i < n; i++) { S1; S2;. Sm; } Says each iteration of body can be executed simultaneously if resources available, or in any order and still get the correct result. The statements of each instance of body executed in order given. Each instance of the body uses a different value of i.

6 Example forall (i = 0; i < 5; i++) a[i] = 0; clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

7 Question How is this specified in OpenMP?

8 Dependency Analysis To identify which processes could be executed together. Example Can see immediately in the code forall (i = 0; i < 5; i++) a[i] = 0; that every instance of the body is independent of other instances and all instances can be executed simultaneously. However, it may not be that obvious. Need algorithmic way of recognizing dependencies, for a parallelizing compiler.

9

10

11 Can use Berstein’s conditions at: Machine instruction level inside processor – have logic to detect if conditions satisfied (see computer architecture course) At the process level to detect whether two processes can be executed simultaneously (using the inputs and outputs of processes). Can be extended to more than two processes but number of conditions rises – need every input/out combination checked. For three statements, need how many conditions checked?

12 Shared Memory Programming Performance Issues

13 Performance issues with Threads Program might actually go slower when parallelized. Too many threads can significantly reduce the program performance.

14 Reasons: Work split among too many threads gives each thread too little work, so overhead of starting and terminating threads swamps useful work. To many concurrent threads incurs overhead from having to share fixed hardware resources OS typically schedules threads in round robin with a time- slice Time-slicing incurs overhead Saving registers, effects on cache memory, virtual memory management …. Waiting to acquire a lock. When a thread is suspended while holding a lock, all threads waiting for lock will have to wait for thread to re- start. Source: Multi-core programming by S. Akhter and J. Roberts, Intel Press.

15 Some Strategies Limit number of runnable threads to number of hardware threads. (See later we do not do this with GPUs) For a n-core machine (not hyper-threaded) have n runnable threads. If hyper-threaded (with 2 virtual threads per core) double this. Can have more threads in total but others may be blocked. Separate I/O threads from compute threads I/O threads wait for external events Never hard-code number of threads – leave as a tuning parameter.

16 Let OpenMP optimize number of threads Implement a thread pool Implement a work stealing approach in which threads has a work queue. Threads with no work take work from other threads

17 Critical Sections Serializing Code High performance programs should have as few as possible critical sections as their use can serialize the code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, the execution time becomes almost that of a single processor.

18 Illustration

19 Shared Data in Systems with Caches All modern computer systems have cache memory, high- speed memory closely attached to each processor for holding recently referenced data and code. Processors Cache memory Main memory

20 Cache coherence protocols Update policy - copies of data in all caches are updated at the time one copy is altered, or Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it.

21 False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated although the actual data is not shared.

22 Solution for False Sharing Compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into different blocks.

23 Sequential Consistency Formally defined by Lamport (1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program. i.e., the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time.

24 Sequential Consistency

25 Writing a parallel program for a system which is known to be sequentially consistent enables us to reason about the result of the program.

26 Example Process P1 Process 2. data = new;. flag = TRUE;... while (flag != TRUE) { };. data_copy = data;. Expect data_copy to be set to new because we expect data = new to be executed before flag = TRUE and while (flag != TRUE) { } to be executed before data_copy = data. Ensures that process 2 reads new data from process 1. Process 2 will simple wait for the new data to be produced.

27 Program Order Sequential consistency refers to “operations of each individual processor.. occur in the order specified in its program” or program order. In previous figure, this order is that of the stored machine instructions to be executed.

28 Compiler Optimizations The order of execution is not necessarily the same as the order of the corresponding high level statements in the source program as a compiler may reorder statements for improved performance. In this case, term program order will depend upon context, either the order in the source program or the order in the compiled machine instructions.

29 High Performance Processors Modern processors also usually reorder machine instructions internally during execution for increased performance. Does not alter a multiprocessor being sequential consistency, if processor only produces final results in program order (that is, retires values to registers in program order, which most processors do). All multiprocessors will have the option of operating under the sequential consistency model. However, it can severely limit compiler optimizations and processor performance.

30 Example of Processor Re-ordering Process P1 Process 2. new = a * b;. data = new;. flag = TRUE;... while (flag != TRUE) { };. data_copy = data;. Multiply machine instruction corresponding to new = a * b is issued for execution. Next instruction corresponding to data = new cannot be issued until the multiply has produced its result. However following statement, flag = TRUE, completely independent and a clever processor could start this operation before multiply has completed leading to the sequence:

31 Process P1 Process 2. new = a * b;. flag = TRUE;. data = new;... while (flag != TRUE) { };. data_copy = data;. Now while statement might occur before new assigned to data, and code would fail. All multiprocessors have option of operating under sequential consistency model, i.e. not reorder instructions and forcing multiply instruction above to complete before starting subsequent instruction that depend upon its result.

32 Relaxing Read/Write Orders Processors may be able to relax the consistency in terms of the order of reads and writes of one processor with respect to those of another processor to obtain higher performance, and instructions to enforce consistency when needed.

33 Examples Alpha processors Memory barrier (MB) instruction - waits for all previously issued memory accesses instructions to complete before issuing any new memory operations. Write memory barrier (WMB) instruction - as MB but only on memory write operations, i.e. waits for all previously issued memory write accesses instructions to complete before issuing any new memory write operations - which means memory reads could be issued after a memory write operation but overtake it and complete before the write operation.

34 SUN Sparc V9 processors memory barrier (MEMBAR) instruction with four bits for variations Write-to-read bit prevent any reads that follow it being issued before all writes that precede it have completed. Other: Write-to-write, read-to-read, read-to-write. IBM PowerPC processor SYNC instruction - similar to Alpha MB instruction (check differences)

35 Questions