Shared Memory Programming via Posix threads

Shared Memory Programming via Posix threads
Laxmikant Kale CS433

Shared Address Space Model
All memory is accessible to all processes Processes are mapped to processors, typically by a symmetric OS Coordination among processes: by sharing variables Avoid “stepping on toes”: using locks and barriers

Running Example: computing pi
Area of circle : π*r*r Ratio of the area of a circle, and that of the enclosing square: π/4 Method: compute a set of random number pairs (in the range 0-1) and count the number of pairs that fall inside the circle The ratio gives us an estimate for π/4 In parallel: Let each processor compute a different set of random number pairs (in the range 0-1) and count the number of pairs that fall inside the circle

Pi on shared memory int count; Lock countLock;
piFunction(int myProcessor) { seed s = makeSeed(myProcessor); for (I=0; I<100000/P; I++) { x = random(s); y = random(s); if (x*x + y*y < 1.0) { lock(countLock); count++; unlock(countLock); }} barrier(); if (myProcessor == 0) { printf(“pi=%f\n”, 4*count/100000); }

main() { countLock = createLock(); parallel(piFunction); } The system needs to provide the functions for locks, barriers, and thread (or process) creation.

How fast will this run? Assume perfect shared memory machine
(I.e. no problem scaling up because of limited bandwidth to memory) But locks are a sequential bottleneck If you have lots of processors, you will find most of them in the queue waiting for the lock, at any given time But we are doing really little work inside the “locked” critical section But obtaining lock is expensive: (we will revisit “why?” later). Can we analyze the performance more precisely? Let Tw be the time for computing outside the lock Let Tc be the time in getting the lock, doing the critical section work, and unlocking Let P be the number of processors

Analysis: How fast will this run?
Can we analyze the performance more precisely? Let Tw be the time for computing outside the lock Let Tc be the time in getting the lock, doing the critical section work, and unlocking Let P be the number of processors Tw: work Tc: critical section

Analysis: How fast will this run?
The other case is when the work section larger than P*Tc Write expressions for completion time in both cases: Tw: work Tc: critical section

Pi on shared memory: efficient version
int count; Lock countLock; piFunction(int myProcessor) { int c; seed s = makeSeed(myProcessor); for (I=0; I<100000/P; I++) { x = random(s); y = random(s); if (x*x + y*y < 1.0) c++; }} lock(countLock); count += c;; unlock(countLock); barrier(); if (myProcessor == 0) { printf(“pi=%f\n”, 4*count/100000); }

Real SAS systems Posix threads (Pthreads) is a standard for threads-based shared memory programming Shared memory calls: just a few, normally standard calls In addition, lower level calls: fetch-and-inc, fetch-and-add

Posix Threads on Origin 2000
Shared memory programming on Origin 2000: Important calls Thread creation and joining pthread_create(pthread_t *threadID, At,functionName, (void *) arg); pthread_join(pthread_t, threadID, void **result); Locks pthread_mutex_t lock; pthread_mutex_lock(&lock); pthread_mutex_unlock(&lock); Condition variables: pthread_cond_t cv; pthread_cond_init(&cv, (pthread_condattr_t *) 0); pthread_cond_wait(&cv, &cv_mutex); pthread_cond_broadcast(&cv); Semaphores, and other calls Follow the web link on the class web page for detailed documentation

Computing pi (Pthreads): Declarations
/* pgm.c */ #include <pthread.h> #include <stdlib.h> #include <stdio.h> #define nThreads 4 #define nSamples typedef struct _shared_value { pthread_mutex_t lock; int value; } shared_value; shared_value sval;

Function in each thread
void *doWork(void *id) { size_t tid = (size_t) id; int nsucc, ntrials, i; ntrials = nSamples/nThreads; nsucc = 0; srand48((long) tid); for(i=0;i<ntrials;i++) { double x = drand48(); double y = drand48(); if((x*x + y*y) <= 1.0) nsucc++; } pthread_mutex_lock(&(sval.lock)); sval.value += nsucc; pthread_mutex_unlock(&(sval.lock)); return 0;

Main function Init lock/s Create threads Wait for threads to complete
int main(int argc, char *argv[]) { pthread_t tids[nThreads]; size_t i; double est; pthread_mutex_init(&(sval.lock), NULL); sval.value = 0; printf("Creating Threads\n"); for(i=0;i<nThreads;i++) pthread_create(&tids[i], NULL, doWork, (void *) i); printf("Created Threads... waiting for them to complete\n"); for(i=0;i<nThreads;i++) pthread_join(tids[i], NULL); printf("Threads Completed...\n"); est = 4.0 * ((double) sval.value / (double) nSamples); printf("Estimated Value of PI = %lf\n", est); exit(0); } Init lock/s Create threads Wait for threads to complete

Compiling : Makefile # Makefile #for solaris FLAGS = -mt
#for Origin2000 #FLAGS = pgm: pgm.c cc -o pgm $(FLAGS) pgm.c -lpthread clean: rm -f pgm *.o *~

So, do we understand the prog. Model?
Consider the following code: a = 1; if (b == 0) { if (z ==0) {z = 1; t=1; } a = 0;} b = 1; if (a == 0) { if (z==0) {z =2; t=2;}; b = 0; } 1 3 - 1 2 5 6 7 8 store a load b load z store z store t 4 1 5 7 10 11 2 4 - 3 4 - store b load a load z store z store t 2 3 6 8 9 12 Expectation: (if z, t began as (0,0)): they can be (0,0) (1,1) or (2,2) but not (1,2), or (2,1). If each processor allows its instructions to be out of order (as long as its own results are consistent), the result can be wrong. For example: the store from processor A may get delayed.

Sequential consistency
So, we want to state that the implementation should disallow such reordering (of one processor’s instructions) As seen by other processors I.e. it is not enough for processor A to issue its operation in order, they must be seen as completed by others in the same order But we don’t want to restrict the freedom of the processor any more than really necessary Speed will suffer Sequential consistency: A parallel program should behave as if there is one processor and one memory (and no cache) I.e. the results should be as if the instructions were interleaved in some order

Sequential consistency
More precisely: operational semantics behave as if there is a single FIFO queue of memory operations coming from all processors (and there is no cache) Now, the architect must keep this contract in mind while building a machine, but the programmer has a concrete understanding of what to expect from their programs and it agrees with their intuitions (for most people).. The architect is NOT required to build such a FIFO queue Just make sure the system behaves as if there is one.

Another example: Proc 1 Proc 2 a=1; b =1; while (b==0) ; // wait
print a; We should not see a 0 printed, right? But a and b may be in different memory modules (or caches) and the change in b may become “visible” to the second process before the change in a Sequential consistency forces the machine (designer) to make a visible before b is visible

Shared Memory Programming via Posix threads

Similar presentations

Presentation on theme: "Shared Memory Programming via Posix threads"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shared Memory Programming via Posix threads

Similar presentations

Presentation on theme: "Shared Memory Programming via Posix threads"— Presentation transcript:

Similar presentations

About project

Feedback