Programming Shared Address Space Platforms Carl Tropper Department of Computer Science.

Slides:



Advertisements
Similar presentations
Operating Systems Semaphores II
Advertisements

Chapter 6: Process Synchronization
Pthreads & Concurrency. Acknowledgements  The material in this tutorial is based in part on: POSIX Threads Programming, by Blaise Barney.
Multi-core Programming Programming with Posix Threads.
Shared Address Space Computing: Programming Alistair Rendell See Chapter 6 or Lin and Synder, Chapter 7 of Grama, Gupta, Karypis and Kumar, and Chapter.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
8-1 JMH Associates © 2004, All rights reserved Windows Application Development Chapter 10 - Supplement Introduction to Pthreads for Application Portability.
Lecture 18 Threaded Programming CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger.
Programming Shared Address Space Platforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel.
Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7 thru 7.9) Prof. Chris Carothers.
Netprog Threads Programming1 Threads Programming Refs: Chapter 23.
UNIT -6 PROGRAMMING SHARED ADDRESS SPACE PLATFORMS THREAD BASICS PREPARED BY:-H.M.PATEL.
Jaehyuk Huh Computer Science, KAIST
Pthread (continue) General pthread program structure –Encapsulate parallel parts (can be almost the whole program) in functions. –Use function arguments.
Introduction to Pthreads. Pthreads Pthreads is a POSIX standard for describing a thread model, it specifies the API and the semantics of the calls. Model.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
Threads and Thread Control Thread Concepts Pthread Creation and Termination Pthread synchronization Threads and Signals.
Parallel Programming with PThreads. Threads  Sometimes called a lightweight process  smaller execution unit than a process  Consists of:  program.
CS345 Operating Systems Threads Assignment 3. Process vs. Thread process: an address space with 1 or more threads executing within that address space,
Programming with POSIX* Threads Intel Software College.
Super computers By: Lecturer \ Aisha Dawood. Overview of Programming Models Programming models provide support for expressing concurrency and synchronization.
1 Pthread Programming CIS450 Winter 2003 Professor Jinhua Guo.
POSIX Synchronization Introduction to Operating Systems: Discussion Module 5.
IT 325 Operating systems Chapter6.  Threads can greatly simplify writing elegant and efficient programs.  However, there are problems when multiple.
Pthreads.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
CS252: Systems Programming Ninghui Li Based on Slides by Prof. Gustavo Rodriguez-Rivera Topic 11: Thread-safe Data Structures, Semaphores.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
1 Condition Variables CS 241 Prof. Brighten Godfrey March 16, 2012 University of Illinois.
Chapter 6 P-Threads. Names The naming convention for a method/function/operation is: – pthread_thing_operation(..) – Where thing is the object used (such.
CS252: Systems Programming Ninghui Li Based on Slides by Prof. Gustavo Rodriguez-Rivera Topic 13: Condition Variable, Read/Write Lock, and Deadlock.
Programming Shared Address Space Platforms Adapted from Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar ``Introduction to Parallel Computing'',
Super computers By: Lecturer \ Aisha Dawood. Overview of Programming Models Programming models provide support for expressing concurrency and synchronization.
PThread Synchronization. Thread Mechanisms Birrell identifies four mechanisms commonly used in threading systems –Thread creation –Mutual exclusion (mutex)
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
Thread Basic Thread operations include thread creation, termination, synchronization, data management Threads in the same process share:  Process address.
COMP7330/7336 Advanced Parallel and Distributed Computing Pthread RW Locks - Implementation Dr. Xiao Qin Auburn University
Working with Pthreads. Operations on Threads int pthread_create (pthread_t *thread, const pthread_attr_t *attr, void * (*routine)(void*), void* arg) Creates.
Programming with Threads. Threads  Sometimes called a lightweight process  smaller execution unit than a process  Consists of:  program counter 
COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn University
pThread synchronization
COMP7330/7336 Advanced Parallel and Distributed Computing POSIX Thread API Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing POSIX Thread API Dr. Xiao Qin Auburn University
Case Study: Pthread Synchronization Dr. Yingwu Zhu.
COMP7330/7336 Advanced Parallel and Distributed Computing Pthread RW Locks – An Application Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Thread Basics Dr. Xiao Qin Auburn University
Background on the need for Synchronization
Programming Shared-memory Machines
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Thread Basics Dr. Xiao Qin Auburn University.
PThreads.
Principles of Operating Systems Lecture 11
Threads Threads.
Netprog: Threads Programming
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Pthread Mutex Dr. Xiao Qin Auburn University.
Multithreading Tutorial
Dr. Xiao Qin Auburn University
Programming Shared Address Space Platforms
Lecture 14: Pthreads Mutex and Condition Variables
Programming Shared Address Space Platforms
Multithreading Tutorial
Read-Write Locks typedef struct { int readers; int writer;
Pthread Prof. Ikjun Yeom TA – Mugyo
Multithreading Tutorial
Multithreading Tutorial
Lecture 14: Pthreads Mutex and Condition Variables
CS 144 Advanced C++ Programming May 7 Class Meeting
“The Little Book on Semaphores” Allen B. Downey
POSIX Threads(pthreads)
Presentation transcript:

Programming Shared Address Space Platforms Carl Tropper Department of Computer Science

Threads On a shared memory architecture, processes are expensive To maintain-they require that all data associated with a process is private To manipulate-overhead for scheduling processes and maintaining security is significant Enter threads, which assume that all memory is globalf Except for stacks associated with the thread Threads are invoked via function calls and are created by a [C] function create_thread

Threads-an example

Logical memory model-a picture Logical model treats thread stack as local data

Why threads? Portable Can develop application on a serial machine and then run it on a parallel machine. Can migrate between different shared memory platforms. Latency While one thread waits for communications, another can use the CPU I/O, memory access latencies can also be hidden Support for load balancing and scheduling in API’s Result- easier to get good performance then message passing, development tools for POSIX threads are widespread and stable

POSIX Thread API IEEE Standard, supported by most vendors Pthreads for short Other thread packages-NT threads, Solaris threads, Java threads Example-compute π by Generating random numbers Counting how many fall within the largest circle which can be inscribed in a unit square (area=π/4) Idea of program-a bunch of threads, each has a fixed number of random points, each counts how many are in the circle

Threads-pthread_create Pthread_create creates threads Invokes thread function include int pthread_create ( pthread_t *thread_handle, const pthread_attr_t *attribute, void * (*thread_function)(void *), void *arg); Thread_handle points to thread ID Thread has attributes specified by attribute argument (later). If null, then get a default thread Arg passes important stuff to thread, e.g. integer which serves as the random seed Pthread_create returns 0 after successful creation of thread

Threads-Pthread_join Pthread_join provides mechanism to communicate results produced by individual threads int pthread_join ( pthread_t thread, void **ptr); Suspends execution of the calling program until thread with ID given by thread terminates Upon completion of pthread_join, the value passed to pthread.exit is returned in the location pointed to by ptr Pthread_join returns 0 upon successful completion

Compute Pi 1 #include #define MAX_THREADS 512 void *compute_pi (void *);.... main() {... pthread_t p_threads[MAX_THREADS]; pthread_attr_t attr; pthread_attr_init (&attr); for (i=0; i< num_threads; i++) { hits[i] = i; pthread_create(&p_threads[i], &attr, compute_pi, (void *) &hits[i]); } for (i=0; i< num_threads; i++) { pthread_join(p_threads[i], NULL); total_hits += hits[i]; }... }

Compute Pi 2 void *compute_pi (void *s) { int seed, i, *hit_pointer; double rand_no_x, rand_no_y; int local_hits; hit_pointer = (int *) s; seed = *hit_pointer; local_hits = 0; for (i = 0; i < sample_points_per_thread; i++) { rand_no_x =(double)(rand_r(&seed))/(double)((2<<14)-1); rand_no_y =(double)(rand_r(&seed))/(double)((2<<14)-1); if (((rand_no_x - 0.5) * (rand_no_x - 0.5) + (rand_no_y - 0.5) * (rand_no_y - 0.5)) < 0.25) local_hits ++; seed *= i; } *hit_pointer = local_hits; pthread_exit(0); }

Threads are so sensitive Could increase local_hits outside (instead of inside) the main loop in compute_pi by changing localHits++ to *(hit_pointer)++ and deleting *hit_pointer=local_hits outside the loop But performance gets whacked because adjacent items in a shared cache line of the hit array are getting written to, causing invalidates to be issued (“False sharing”) Solution- change hits to a 2-D array. Arrays are stored row major, so cache lines won’t be shared because we force distance between the entries

False sharing and the solution Spaced_1 is the original change Spaced_16=16 integer 2nd column Spaced_32 correspond to 32 integer 2nd column

Synchronization Primitives Mutual exclusion for shared variables Race conditions-for example If (my_cost<best cost) best_cost=my_cost If best cost is initially 100, there are two threads with my_cost =50,75, we have a non-deterministic outcome ! A race determines the value of best_cost So threaded API’s provide support for critical sections and atomic operations

Mutex_locks Mutex_locks (mutual exclusion locks) Atomic operation associated with code or a critical section Thread tries to get lock-if it is locked, the thread is blocked Thread leaving the cs has to unlock the mutex_lock P_threads provides functions for dealing with locks

Mutex_locks-the 3 maidens Int Pthread_mutex_lock { pthread_mutex_t *mutex_lock}; attempts a lock on mutex_lock, blocks if mutex_lock is blocked success returns 0 Int Pthread_mutex_unlock{ pthread_mutex_t *mutex_lock}; Mutex_lock is released and a blocked thread is allowed in Int Pthread_mutex_init{ pthread_mutex_t *mutex_lock, const pthread_mutexattr_t *lock_attr}; Unlocks mutex_lock. Atributes of mutex_lock are specified in in *lock_attr. NULL supplies default values Atributes of mutex_lock are specified in in *lock_attr. NULL supplies default values.

Compute the minimum of a list of integers We can now write our previously incorrect code segment as: pthread_mutex_t minimum_value_lock;... main() {.... pthread_mutex_init(&minimum_value_lock, NULL);.... } void *find_min(void *list_ptr) {.... pthread_mutex_lock(&minimum_value_lock); if (my_min < minimum_value) minimum_value = my_min; /* and unlock the mutex */ pthread_mutex_unlock(&minimum_value_lock); }

Producer Consumer Work Queues Scenario-producer thread creates tasks and puts them in work queues. Consumer threads grab tasks and executes them Can use shared data structure for the tasks, but…… Producer thread must not overwrite buffer before consumer thread picks up task Consumer threads must not pick up tasks which don’t exist Pick up one task at a time

Producer Consumer Task_available =1 producer task has to wait to produce =0 consumer thread can pick up task Need to protect operations on task available variable

Producer Code pthread_mutex_t task_queue_lock; int task_available;... main() {.... task_available = 0; pthread_mutex_init(&task_queue_lock, NULL);.... } void *producer(void *producer_thread_data) {.... while (!done()) { inserted = 0; create_task(&my_task); while (inserted == 0) { pthread_mutex_lock(&task_queue_lock); if (task_available == 0) { insert_into_queue(my_task); task_available = 1; inserted = 1; } pthread_mutex_unlock(&task_queue_lock); }

Consumer Code void *consumer(void *consumer_thread_data) { int extracted; struct task my_task; /* local data structure declarations */ while (!done()) { extracted = 0; while (extracted == 0) { pthread_mutex_lock(&task_queue_lock); if (task_available == 1) { extract_from_queue(&my_task); task_available = 0; extracted = 1; } pthread_mutex_unlock(&task_queue_lock); } process_task(my_task); }

Overheads of locking Size of critical section matters-only one thread at a time is allowed in the section, i.e. the program becomes serial What to do? Instead of blocking a thread when mutex is locked, allow thread to do other work and poll mutex to see if it is unlocked Int Pthread_mutex_trylock{ pthread_mutex_t *mutex_lock}

K matches in a list /* Finding k matches in a list */ void *find_entries(void *start_pointer) { /* This is the thread function */ struct database_record *next_record; int count; current_pointer = start_pointer; do { next_record = find_next_entry(current_pointer); count = output_record(next_record); } while (count < requested_number_of_records); } int output_record(struct database_record *record_ptr) { int count; pthread_mutex_lock(&output_count_lock); output_count ++; count = output_count; pthread_mutex_unlock(&output_count_lock); if (count <= requested_number_of_records) print_record(record_ptr); return (count); }

K matches (continued) /* rewritten output_record function */ int output_record(struct database_record *record_ptr) { int count; int lock_status; lock_status=pthread_mutex_trylock(&output_count_lock); if (lock_status == EBUSY) { insert_into_local_list(record_ptr); return(0); } else { count = output_count; output_count += number_on_local_list + 1; pthread_mutex_unlock(&output_count_lock); print_records(record_ptr, local_list, requested_number_of_records - count); return(count + number_on_local_list + 1); }

Condition variables Polling takes time. In the producer-consumer example, the producer and consumer threads have to poll for a lock. An easier idea would be for the consumer thread to signal the producer thread when buffer space is available. Enter condition variables-a thread can block itself if a predicate becomes true Task_available==1 in producer-consumer Thread locks mutex on condition variable, and tests the predicate-if the predicate is false, the thread waits on the condition variable

Condition variables Int pthread_cond_wait(pthread_cond_t *cond,pthread_mutex_t *mutex); is the function used to wait on the condition variable The function blocks the thread until it receives a signal from the OS or from another thread Releases the lock on mutex-have to do this in order for other threads to be able to change the predicate Thread enters a queue When a condition becomes true, it is signaled using pthread_cond_signal, unblocking one of the threads in the queue

Producer Consumer Producer produces task, inserts task on queue, sets task-available=1 Sends signal to consumer thread via Int pthread_cond_signal(pthread_cond_t *cond); Releases lock on mutex via pthread_mutex_unlock Initialize condition variable via Int pthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr); Destroy condition variable via Int pthread_cond_destroy(pthread_cond_t *cond)

Producer Consumer Using Condition Variables pthread_cond_t cond_queue_empty, cond_queue_full; pthread_mutex_t task_queue_cond_lock; int task_available; /* other data structures here */ main() { /* declarations and initializations */ task_available = 0; pthread_init(); pthread_cond_init(&cond_queue_empty, NULL); pthread_cond_init(&cond_queue_full, NULL); pthread_mutex_init(&task_queue_cond_lock, NULL); /* create and join producer and consumer threads */ }

Producer Consumer Using Condition Variables II /*create and join producer and consumer threads*/ void *producer(void *producer_thread_data) { int inserted; while (!done()) { create_task(); pthread_mutex_lock(&task_queue_cond_lock); while (task_available == 1) pthread_cond_wait(&cond_queue_empty, &task_queue_cond_lock); insert_into_queue(); task_available = 1; pthread_cond_signal(&cond_queue_full); pthread_mutex_unlock(&task_queue_cond_lock); }

Producer Consumer Using Condition Variables II void *consumer(void *consumer_thread_data) { while (!done()) { pthread_mutex_lock(&task_queue_cond_lock); while (task_available == 0) pthread_cond_wait(&cond_queue_full, &task_queue_cond_lock); my_task = extract_from_queue(); task_available = 0; pthread_cond_signal(&cond_queue_empty); pthread_mutex_unlock(&task_queue_cond_lock); process_task(my_task); }

Controlling Thread and Synchronization Attributes Threads have different attributes (scheduling algorithms, stack sizes….) Attributes object has default attributes- programmer can modify them Advantage-system implements the attributes easier to read easier to change attributes easier to port to another OS

Attribute Objects for threads Use pthread_attr_init to create an attributes object (pointed to by *attr)with default values pthread_attr_destroy takes out attributes object Individual properties associated with the attributes object can be changed using the following functions: pthread_attr_setdetachstate, pthread_attr_setguardsize_np, pthread_attr_setstacksize, pthread_attr_setinheritsched, pthread_attr_setschedpolicy, pthread_attr_setschedparam

Attribute objects for mutexes Pthreads supports 3 types of locks Normal-the default brand. Deadlocks if a thread with a lock attempts to lock it again. Recursive code can cause this to happen. Recursive mutex-single thread can lock a mutex multiple times » Each time it locks the mutex a counter is incremented »each time it unlocks the mutex, a counter is decremented. »Any other thread can lock only if the counter==0 Error-check mutex-like a normal mutex, only it returns an error if a thread tries to lock a mutex twice. Good for debugging.

Attribute Objects for mutexes Initialize the attributes object using function: pthread_mutexattr_init. The function pthread_mutexattr_settype_np can be used for setting the type of mutex specified by the mutex attributes object. pthread_mutexattr_settype_np ( pthread_mutexattr_t *attr, int type); Here, type specifies the type of the mutex and can take one of: – PTHREAD_MUTEX_NORMAL_NP – PTHREAD_MUTEX_RECURSIVE_NP – PTHREAD_MUTEX_ERRORCHECK_NP

Thread cancellation Sometimes it is necessary to cancel thread(s)- threads which evaluate different strategies in parallel need to go after best is chosen, e.g. chess Use pthread_cancel (pthread_t thread); Idea-a thread can cancel itself or other threads

Composite Synchronization Structures Look at Useful “higher-level” constructs Useful because they meet commonly occurring needs Constructs built by combining basic synch structures Look at –Read-write locks –Barriers

Read-write locks Think of readers and writers in a critical section Model-many readers, few writers Idea-let the readers read concurrently, but serialize the writers Read lock is granted if there are only readers If there is a write lock (or queued writes), thread does a condition_wait

Read write locks Read write locks based on a data structure mylib_rwlock_t containing the following –a count of the number of readers, –the writer (a 0/1 integer specifying whether a writer is present), –a condition variable readers_proceed that is signaled when readers can proceed, –a condition variable writer_proceed that is signaled when one of the writers can proceed, –a count pending_writers of pending writers, and –a mutex read_write_lock associated with the shared data structure

Read-write locks II Mylib_rwlock_rlock first checks for a write lock or if there are writers queueing. If so, then it does a condition_wait on readers_proceed, else it grants read lock and increments the count of readers Mylib_rwlock_wlock checks for readers or writers-if so, it increments the count of writers and does a condition_wait on writer_proceed, else it grants the lock Mylib_rwlock_unlock

Read-write locks typedef struct { int readers; int writer; pthread_cond_t readers_proceed; pthread_cond_t writer_proceed; int pending_writers; pthread_mutex_t read_write_lock; } mylib_rwlock_t; void mylib_rwlock_init (mylib_rwlock_t *l) { l -> readers = l -> writer = l -> pending_writers = 0; pthread_mutex_init(&(l -> read_write_lock), NULL); pthread_cond_init(&(l -> readers_proceed), NULL); pthread_cond_init(&(l -> writer_proceed), NULL); }

Read-write locks code II void mylib_rwlock_rlock(mylib_rwlock_t *l) { /* if there is a write lock or pending writers, perform condition wait.. else increment count of readers and grant read lock */ pthread_mutex_lock(&(l -> read_write_lock)); while ((l -> pending_writers > 0) || (l -> writer > 0)) pthread_cond_wait(&(l -> readers_proceed), &(l -> read_write_lock)); l -> readers ++; pthread_mutex_unlock(&(l -> read_write_lock)); }

Read-write locks code III void mylib_rwlock_wlock(mylib_rwlock_t *l) { /* if there are readers or writers, increment pending writers count and wait. On being woken, decrement pending writers count and increment writer count */ pthread_mutex_lock(&(l -> read_write_lock)); while ((l -> writer > 0) || (l -> readers > 0)) { l -> pending_writers ++; pthread_cond_wait(&(l -> writer_proceed), &(l -> read_write_lock)); } l -> pending_writers --; l -> writer ++; pthread_mutex_unlock(&(l -> read_write_lock)); }

Barriers Can be implemented using counter, mutex,condition variable. Use an integer to keep track of the number of threads which reach the barrier If count<number of threads, threads execute condition wait If count==number of threads, last thread wakes up all the threads using a condition broadcast

Barrier code typedef struct { pthread_mutex_t count_lock; pthread_cond_t ok_to_proceed; int count; } mylib_barrier_t; void mylib_init_barrier(mylib_barrier_t *b) { b -> count = 0; pthread_mutex_init(&(b -> count_lock), NULL); pthread_cond_init(&(b -> ok_to_proceed), NULL); }

Barrier code II void mylib_barrier (mylib_barrier_t *b, int num_threads) { pthread_mutex_lock(&(b -> count_lock)); b -> count ++; if (b -> count == num_threads) { b -> count = 0; pthread_cond_broadcast(&(b -> ok_to_proceed)); } else while (pthread_cond_wait(&(b -> ok_to_proceed), &(b -> count_lock)) != 0); pthread_mutex_unlock(&(b -> count_lock)); }

Barrier thoughts Lower bound on execution time is O(n), since broadcast message follows queue Speed things up by using a tree to implement the barriers Pair threads up so that each pair shares a single condition variable mutex pair A designated member of the pair waits for both threads to arrive at the barrier Continue until there is one thread Releasing threads requires signaling n/2 condition variables

Performance on SGI Origin 2000

Designing asynchronous programs- things that go wrong Thread T1 creates thread T2.It wants to put data to send to T2 in global memory T1 gets switched after creating T2 but before placing data.T2 might look for data and get garbage. T1 creates T2, which is on T1’s stack. T1 passes data to T1, but finishes before T2 is scheduled. Stack is destroyed and T2 never gets its data.