Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and.

Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99
CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel

Recap of Last Lecture There are several standard programming models (plus variations) that were developed to support particular kinds of architectures shared memory message passing data parallel The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness Portability of performance still depends on tuning for each architecture In each model, parallel programming has 4 phases decomposition into parallel tasks assignment of tasks to threads orchestration of communication and synchronization among threads mapping threads to processors

Outline Performance modeling and tradeoffs Shared memory architectures Shared memory programming

Cost Modeling and Performance Tradeoffs

Example s = f(A[1]) + … + f(A[n]) Decomposition Assignment
computing each f(A[j]) n-fold parallelism, where n may be >> p computing sum s Assignment thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread 1 sums s = s1+ … + sp for simplicity of this example, will be improved thread 1 communicates s to other threads Orchestration starting up threads communicating, synchronizing with thread 1 Mapping processor j runs thread j

Identifying enough Concurrency
Parallelism profile area is total work done n n x time(f) Simple Decomposition: f ( A[i] ) is the parallel task sum is sequential Concurrency 1 x time(sum(n)) Time Amdahl’s law bounds speedup let s = the fraction of total work done sequentially After mapping p Concurrency p x n/p x time(f)

Algorithmic Trade-offs
Parallelize partial sum of the f’s what fraction of the computation is “sequential” what does this do for communication? locality? what if you sum what you “own” p x time(sum(n/p) ) Concurrency 1 x time(sum(p)) p x n/p x time(f)

Problem Size is Critical
Amdahl’s Law Bounds Total work= n + P Serial work: P Parallel work: n s = serial fraction = P/ (n+P) Speedup(P)=n/(n/P+P) Speedup decreases for large P if n small n In general seek to exploit a fraction of the peak parallelism in the problem.

Algorithmic Trade-offs
Parallelize the final summation (tree sum) Generalize Amdahl’s law for arbitrary “ideal” parallelism profile Concurrency p x n/p x time(f) p x time(sum(n/p) ) log_2 p x time(sum(2))

Shared Memory Architectures

Recap Basic Shared Memory Architecture
Processors all connected to a large shared memory Local caches for each processor Cost: much cheaper to cache than main memory P1 P2 Pn network $ memory Simplest to program, but hard to build with many processors Now take a closer look at structure, costs, limits

Limits of using Bus as Network
Assume 100 MB/s bus 50 MIPS processor w/o cache => 200 MB/s inst BW per processor => 60 MB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate (16 byte block) => 4 MB/s inst BW per processor => 12 MB/s data BW per processor => 16 MB/s combined BW \ 8 processors will saturate bus I/O MEM ° ° ° MEM 16 MB/s ° ° ° cache cache 260 MB/s PROC PROC Cache provides bandwidth filter – as well as reducing average access time

Cache Coherence: The Semantic Problem
p1 and p2 both have cached copies of x (as 0) p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x writing f pulls it into p1’s cache both of these writes “write through” to memory p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copy x = 1 f = 1 x 1 f 1 x 0 f 1 p1 p2 SMPs have complicated caches to enforce coherence

Programming SMPs Coherent view of shared memory
All addresses equidistant don’t worry about data partitioning Caches automatically replicate shared data close to processor If program concentrates on a block of the data set that no one else updates => very fast Communication occurs only on cache misses cache misses are slow Processor cannot distinguish communication misses from regular cache misses Cache block may introduce unnecessary communication two distinct variables in the same cache block false sharing

Where are things going High-end Mid-end Low-end Major change ahead
collections of almost complete workstations/SMP on high-speed network (Millennium) with specialized communication assist integrated with memory system to provide global access to shared data Mid-end almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network Sun Enterprise 10000, IBM J90, HP/Convex SPP volume approach is Pentium pro quadpack + SCI ring Sequent, Data General Low-end SMP desktop is here Major change ahead SMP on a chip as a building block

Programming Shared Memory Machines
Creating parallelism in shared memory models Synchronization Building shared data structures Performance programming (throughout) 2

Programming with Threads
Several Threads Libraries PTHREADS is the Posix Standard Solaris threads are very similar Relatively low level Portable but possibly slow P4 (Parmacs) is a widely used portable package Higher level than Pthreads OpenMP is new proposed standard Support for scientific programming on shared memory Currently a Fortran interface Initiated by SGI, Sun is not currently supporting this

Creating Parallelism

Language Notions of Thread Creation
cobegin/coend fork/join cobegin cleaner, but fork is more general cobegin job1(a1); job2(a2); coend Statements in block may run in parallel cobegins may be nested Scoped, so you cannot have a missing coend tid1 = fork(job1, a1); job2(a2); join tid1; Forked function runs in parallel with current join waits for completion (may be in different function)

Forking Threads in Solaris
Signature: int thr_create(void *stack_base, size_t stack_size, void *(* start_func)(void *), void *arg, long flags, thread_t *new_tid) thr_create(NULL, NULL, start_func, arg, NULL, &tid) Example: start_fun defines the thread body start_fun takes one argument of type void* and returns void* an argument can be passed as arg j-th thread gets arg=j so it knows who it is stack_base and stack_size give the stack standard default values flags controls various attributes standard default values for now new_tid thread id (for thread creator to identify threads)

Example Using Solaris Threads
main() { thread_ptr = (thrinfo_t *) malloc(NTHREADS * sizeof(thrinfo_t)); thread_ptr[0].chunk = 0; thread_ptr[0].tid = myID; for (i = 1; i < NTHREADS; i++) { thread_ptr[i].chunk = i; if (thr_create(0, 0, worker, (void*)&thread_ptr[i].chunk. 0, &thread_ptr[i].tid)) { perror("thr_create"); exit(1); } worker(0); for (i = 1; i < NTHREADS; ++i) thr_join(thread_ptr[i].tid, NULL, NULL);

Synchronization

Basic Types of Synchronization: Barrier
Barrier -- global synchronization fork multiple copies of the same function “work” SPMD “Single Program Multiple Data” simple use of barriers -- a threads hit the same one more complicated -- barriers on branches or in loops -- need equal number of barriers executed barriers are not provided in many thread libraries need to build them work_on_my_subgrid(); barrier; read_neighboring_values(); if (tid % 2 == 0) { work1(); barrier } else { barrier }

Basic Types of Synchronization: Mutexes
Mutexes -- mutual exclusion aka locks threads are working mostly independently need to access common data structure Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);

Basic Types of Synchronization: Post/Wait
Post/Wait -- producer consumer synchronization post/wait not as common a term could be done with generalization of locks to reader/writer locks sometimes done with barrier, if there is global agreement waitlock *l = alloc_and_init(); /* shared */ P P2 big_data - new_value; post(l); wait(l); use big_data;

Synchronization at Different Levels
Can build it yourself out of flags while (!flag) {}; Lock/Unlock primitives build in the waiting typically well tested system friendly sometimes optimized for the machine sometimes higher overhead than building your own Most systems provide higher level synchronization primitives barrier - global synchronization semaphores monitors

Solaris Threads Example
mutex_t mul_lock; barrier_t ba; int sum; main() { sync_type = USYNC_PROCESS; mutex_init(&mul_lock, sync_type, NULL); barrier_init(&ba, NTHREADS, sync_type, NULL …. spawn all the threads as above… } worker (int me) int x = all_do_work(me); barrier_wait(&ba); mutex_lock(&mul_lock); sum =+ mine; mutex_unlock(&mul_lock);

Producer-Consumer Synchronization
A very common pattern in parallel programs special case is write-once variable Motivated language constructs that combine parallelism and synchronization future as in Multilisp next_job (future (job1 (x1)), future (job2(x2))) job1 and job2 will run in parallel, next_job will run until args are needed, e.g., arg1+arg2 inside next_job implemented using presence bits (hardware?) promise: like future, but need to explicitly ask for a promise T and promise[T] are different types more efficient, but requires more programmer control producer x = consumer

Rolling Your Own Synchronization
Natural to use a variable for producer/consumer This works as long as your compiler and machine implement “sequential consistency” [Lamport] The parallel execution must behave as if it were an interleaving of the sequences of memory operations from each of the processors. P P2 big_data = new_value; while (flag != 1) ; flag = 1; …. = ...big_data... There must exist some: serial (total) order consistent with the partial order that is a correct sequential execution w x w x r z r x w y union forms a partial order

But Machines aren’t Always Sequentially Consistency
hardware does out-of-order execution hardware issues multiple writes placed into (mostly FIFO) write buffer second write to same location can be merged into earlier hardware reorders reads first misses in cache, second does not compiler puts something in a register, eliminating some reads and writes compiler reorders reads or writes writes are going to physically remote locations first write is further away than second

Programming Solutions
At the compiler level, the best defense is: declaring variables as volatile At the hardware level, there are instructions: memory barriers (bad term) or fences that forces serialization only serialize operations executed by one processor different flavors, depending on the machine Sequential consistency is sufficient but not necessary for hand-made synchronization many machines only reorder reads with respect to writes the flag example breaks only if read/read pairs reordered or write/write pairs reordered

Foundation Behind Sequential Consistency
All violations of sequential consistency are figure 8s Generalized to multiple processors (can be longer) Intuition: Time cannot move backwards All violations appear as “simple cycles” that visit each processor at most twice [Shasha&Snir] P P2 big_data = new_value; while (flag != 1) ; flag = 1; …. = ...big_data...

Building Shared Data Structures

Shared Address Allocation
Some systems provide a special form of malloc/free p4_shmalloc, p4_shfree p4_malloc, p4_free are just basic malloc/free sharing unspecified Solaris threads malloc’d and static variables are shared

Building Parallel Data Structures
Since data is shared, this is easy Some awkwardness comes in setup only 1 argument passed to all threads need to package data (pointers to) into a structure and pass that otherwise everything global (hard to read for usual reasons) Depends on type of data structure

For data structures that are static (regular in time)
typically partition logically, although not physically need to divide work evenly, often done by dividing data use the “owner computes” rule true for irregular (unstructured) data like meshes too usually barriers are sufficient synchronization For Dynamic data structures need locking or other synchronization Example: tree-based particle computation parts of this computation are naturally partitioned each processor/thread has a set of particles each works on updating a part of the tree during a tree walk, need to look at other parts of the tree locking used (on nodes) to prevent simultaneous updates

Summary

Uniform Shared Address Space
Programmers view is still processor-centric Specifies what each processor/thread does Global view implicit in pattern of data sharing Shared Data Structure Local / Stack Data

Segmented Shared Address Space
Programmer has local and global view Specifies what each processor/thread does Global data, operation, and synchronization Shared Data Structure Local / Stack Data

Work vs. Data Assignment
for (I = MyProc; I<n; I+=PROCS) { A[I] = f(I); } Assignment of work is easier in a global address space It is faster if it corresponds to the data placement Hardware replication moves data to where it is accessed

Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and.

Similar presentations

Presentation on theme: "Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and.

Similar presentations

Presentation on theme: "Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and."— Presentation transcript:

Similar presentations

About project

Feedback