Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
SE-292 High Performance Computing
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Processes CSCI 444/544 Operating Systems Fall 2008.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.
CS267 L4 Shared Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
CSCI-455/552 Introduction to High Performance Computing Lecture 19.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Chapter 4 – Thread Concepts
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Chapter 3: Process Concept
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Memory Consistency Models
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 4 – Thread Concepts
Atomic Operations in Hardware
CS399 New Beginnings Jonathan Walpole.
Process Management Presented By Aditya Gupta Assistant Professor
Memory Consistency Models
Chapter 3: Process Concept
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Threads and Memory Models Hal Perkins Autumn 2011
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Programming with Shared Memory
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Introduction to High Performance Computing Lecture 20
Threads and Memory Models Hal Perkins Autumn 2009
Shared Memory Programming
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Background and Motivation
Distributed Systems CS
Programming with Shared Memory
Memory Consistency Models
Chapter 4: Threads & Concurrency
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CSE 153 Design of Operating Systems Winter 19
Lecture 24: Multiprocessors
Programming with Shared Memory - 3 Recognizing parallelism
Foundations and Definitions
Programming with Shared Memory Specifying parallelism
Lecture 18: Coherence and Synchronization
Presentation transcript:

Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99

Recap of Last Lecture There are several standard programming models (plus variations) that were developed to support particular kinds of architectures shared memory message passing data parallel The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness Portability of performance still depends on tuning for each architecture In each model, parallel programming has 4 phases decomposition into parallel tasks assignment of tasks to threads orchestration of communication and synchronization among threads mapping threads to processors

Outline Performance modeling and tradeoffs Shared memory architectures Shared memory programming

Cost Modeling and Performance Tradeoffs

Example s = f(A[1]) + … + f(A[n]) Decomposition Assignment computing each f(A[j]) n-fold parallelism, where n may be >> p computing sum s Assignment thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread 1 sums s = s1+ … + sp for simplicity of this example, will be improved thread 1 communicates s to other threads Orchestration starting up threads communicating, synchronizing with thread 1 Mapping processor j runs thread j

Identifying enough Concurrency Parallelism profile area is total work done n n x time(f) Simple Decomposition: f ( A[i] ) is the parallel task sum is sequential Concurrency 1 x time(sum(n)) Time Amdahl’s law bounds speedup let s = the fraction of total work done sequentially After mapping p Concurrency p x n/p x time(f)

Algorithmic Trade-offs Parallelize partial sum of the f’s what fraction of the computation is “sequential” what does this do for communication? locality? what if you sum what you “own” p x time(sum(n/p) ) Concurrency 1 x time(sum(p)) p x n/p x time(f)

Problem Size is Critical Amdahl’s Law Bounds Total work= n + P Serial work: P Parallel work: n s = serial fraction = P/ (n+P) Speedup(P)=n/(n/P+P) Speedup decreases for large P if n small n In general seek to exploit a fraction of the peak parallelism in the problem.

Algorithmic Trade-offs Parallelize the final summation (tree sum) Generalize Amdahl’s law for arbitrary “ideal” parallelism profile Concurrency p x n/p x time(f) p x time(sum(n/p) ) log_2 p x time(sum(2))

Shared Memory Architectures

Recap Basic Shared Memory Architecture Processors all connected to a large shared memory Local caches for each processor Cost: much cheaper to cache than main memory P1 P2 Pn network $ memory Simplest to program, but hard to build with many processors Now take a closer look at structure, costs, limits

Limits of using Bus as Network Assume 100 MB/s bus 50 MIPS processor w/o cache => 200 MB/s inst BW per processor => 60 MB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate (16 byte block) => 4 MB/s inst BW per processor => 12 MB/s data BW per processor => 16 MB/s combined BW \ 8 processors will saturate bus I/O MEM ° ° ° MEM 16 MB/s ° ° ° cache cache 260 MB/s PROC PROC Cache provides bandwidth filter – as well as reducing average access time

Cache Coherence: The Semantic Problem p1 and p2 both have cached copies of x (as 0) p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x writing f pulls it into p1’s cache both of these writes “write through” to memory p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copy x = 1 f = 1 x 1 f 1 x 0 f 1 p1 p2 SMPs have complicated caches to enforce coherence

Programming SMPs Coherent view of shared memory All addresses equidistant don’t worry about data partitioning Caches automatically replicate shared data close to processor If program concentrates on a block of the data set that no one else updates => very fast Communication occurs only on cache misses cache misses are slow Processor cannot distinguish communication misses from regular cache misses Cache block may introduce unnecessary communication two distinct variables in the same cache block false sharing

Where are things going High-end Mid-end Low-end Major change ahead collections of almost complete workstations/SMP on high-speed network (Millennium) with specialized communication assist integrated with memory system to provide global access to shared data Mid-end almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network Sun Enterprise 10000, IBM J90, HP/Convex SPP volume approach is Pentium pro quadpack + SCI ring Sequent, Data General Low-end SMP desktop is here Major change ahead SMP on a chip as a building block

Programming Shared Memory Machines Creating parallelism in shared memory models Synchronization Building shared data structures Performance programming (throughout) 2

Programming with Threads Several Threads Libraries PTHREADS is the Posix Standard Solaris threads are very similar Relatively low level Portable but possibly slow P4 (Parmacs) is a widely used portable package Higher level than Pthreads OpenMP is new proposed standard Support for scientific programming on shared memory Currently a Fortran interface Initiated by SGI, Sun is not currently supporting this

Creating Parallelism

Language Notions of Thread Creation cobegin/coend fork/join cobegin cleaner, but fork is more general cobegin job1(a1); job2(a2); coend Statements in block may run in parallel cobegins may be nested Scoped, so you cannot have a missing coend tid1 = fork(job1, a1); job2(a2); join tid1; Forked function runs in parallel with current join waits for completion (may be in different function)

Forking Threads in Solaris Signature: int thr_create(void *stack_base, size_t stack_size, void *(* start_func)(void *), void *arg, long flags, thread_t *new_tid) thr_create(NULL, NULL, start_func, arg, NULL, &tid) Example: start_fun defines the thread body start_fun takes one argument of type void* and returns void* an argument can be passed as arg j-th thread gets arg=j so it knows who it is stack_base and stack_size give the stack standard default values flags controls various attributes standard default values for now new_tid thread id (for thread creator to identify threads)

Example Using Solaris Threads main() { thread_ptr = (thrinfo_t *) malloc(NTHREADS * sizeof(thrinfo_t)); thread_ptr[0].chunk = 0; thread_ptr[0].tid = myID; for (i = 1; i < NTHREADS; i++) { thread_ptr[i].chunk = i; if (thr_create(0, 0, worker, (void*)&thread_ptr[i].chunk. 0, &thread_ptr[i].tid)) { perror("thr_create"); exit(1); } worker(0); for (i = 1; i < NTHREADS; ++i) thr_join(thread_ptr[i].tid, NULL, NULL);

Synchronization

Basic Types of Synchronization: Barrier Barrier -- global synchronization fork multiple copies of the same function “work” SPMD “Single Program Multiple Data” simple use of barriers -- a threads hit the same one more complicated -- barriers on branches or in loops -- need equal number of barriers executed barriers are not provided in many thread libraries need to build them work_on_my_subgrid(); barrier; read_neighboring_values(); if (tid % 2 == 0) { work1(); barrier } else { barrier }

Basic Types of Synchronization: Mutexes Mutexes -- mutual exclusion aka locks threads are working mostly independently need to access common data structure Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);

Basic Types of Synchronization: Post/Wait Post/Wait -- producer consumer synchronization post/wait not as common a term could be done with generalization of locks to reader/writer locks sometimes done with barrier, if there is global agreement waitlock *l = alloc_and_init(); /* shared */ P1 P2 big_data - new_value; post(l); wait(l); use big_data;

Synchronization at Different Levels Can build it yourself out of flags while (!flag) {}; Lock/Unlock primitives build in the waiting typically well tested system friendly sometimes optimized for the machine sometimes higher overhead than building your own Most systems provide higher level synchronization primitives barrier - global synchronization semaphores monitors

Solaris Threads Example mutex_t mul_lock; barrier_t ba; int sum; main() { sync_type = USYNC_PROCESS; mutex_init(&mul_lock, sync_type, NULL); barrier_init(&ba, NTHREADS, sync_type, NULL …. spawn all the threads as above… } worker (int me) int x = all_do_work(me); barrier_wait(&ba); mutex_lock(&mul_lock); sum =+ mine; mutex_unlock(&mul_lock);

Producer-Consumer Synchronization A very common pattern in parallel programs special case is write-once variable Motivated language constructs that combine parallelism and synchronization future as in Multilisp next_job (future (job1 (x1)), future (job2(x2))) job1 and job2 will run in parallel, next_job will run until args are needed, e.g., arg1+arg2 inside next_job implemented using presence bits (hardware?) promise: like future, but need to explicitly ask for a promise T and promise[T] are different types more efficient, but requires more programmer control producer x = 100 consumer

Rolling Your Own Synchronization Natural to use a variable for producer/consumer This works as long as your compiler and machine implement “sequential consistency” [Lamport] The parallel execution must behave as if it were an interleaving of the sequences of memory operations from each of the processors. P1 P2 big_data = new_value; while (flag != 1) ; flag = 1; …. = ...big_data... There must exist some: serial (total) order consistent with the partial order that is a correct sequential execution w x w x r z r x w y union forms a partial order

But Machines aren’t Always Sequentially Consistency hardware does out-of-order execution hardware issues multiple writes placed into (mostly FIFO) write buffer second write to same location can be merged into earlier hardware reorders reads first misses in cache, second does not compiler puts something in a register, eliminating some reads and writes compiler reorders reads or writes writes are going to physically remote locations first write is further away than second

Programming Solutions At the compiler level, the best defense is: declaring variables as volatile At the hardware level, there are instructions: memory barriers (bad term) or fences that forces serialization only serialize operations executed by one processor different flavors, depending on the machine Sequential consistency is sufficient but not necessary for hand-made synchronization many machines only reorder reads with respect to writes the flag example breaks only if read/read pairs reordered or write/write pairs reordered

Foundation Behind Sequential Consistency All violations of sequential consistency are figure 8s Generalized to multiple processors (can be longer) Intuition: Time cannot move backwards All violations appear as “simple cycles” that visit each processor at most twice [Shasha&Snir] P1 P2 big_data = new_value; while (flag != 1) ; flag = 1; …. = ...big_data...

Building Shared Data Structures

Shared Address Allocation Some systems provide a special form of malloc/free p4_shmalloc, p4_shfree p4_malloc, p4_free are just basic malloc/free sharing unspecified Solaris threads malloc’d and static variables are shared

Building Parallel Data Structures Since data is shared, this is easy Some awkwardness comes in setup only 1 argument passed to all threads need to package data (pointers to) into a structure and pass that otherwise everything global (hard to read for usual reasons) Depends on type of data structure

For data structures that are static (regular in time) typically partition logically, although not physically need to divide work evenly, often done by dividing data use the “owner computes” rule true for irregular (unstructured) data like meshes too usually barriers are sufficient synchronization For Dynamic data structures need locking or other synchronization Example: tree-based particle computation parts of this computation are naturally partitioned each processor/thread has a set of particles each works on updating a part of the tree during a tree walk, need to look at other parts of the tree locking used (on nodes) to prevent simultaneous updates

Summary

Uniform Shared Address Space Programmers view is still processor-centric Specifies what each processor/thread does Global view implicit in pattern of data sharing Shared Data Structure Local / Stack Data

Segmented Shared Address Space Programmer has local and global view Specifies what each processor/thread does Global data, operation, and synchronization Shared Data Structure Local / Stack Data

Work vs. Data Assignment for (I = MyProc; I<n; I+=PROCS) { A[I] = f(I); } Assignment of work is easier in a global address space It is faster if it corresponds to the data placement Hardware replication moves data to where it is accessed