Portable Operating System Interface Thread Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Threads. Readings r Silberschatz et al : Chapter 4.

Chapter 7 Process Environment Chien-Chung Shen CIS, UD

Programming with Posix Threads CS5204 Operating Systems.

Pthreads & Concurrency. Acknowledgements  The material in this tutorial is based in part on: POSIX Threads Programming, by Blaise Barney.

Multi-core Programming Programming with Posix Threads.

PTHREADS These notes are from LLNL Pthreads Tutorial

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

8-1 JMH Associates © 2004, All rights reserved Windows Application Development Chapter 10 - Supplement Introduction to Pthreads for Application Portability.

Lecture 18 Threaded Programming CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger.

Unix Threads operating systems. User Thread Packages pthread package mach c-threads Sun Solaris3 UI threads Kernel Threads Windows NT, XP operating systems.

Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.

Pthread II. Outline Join Mutex Variables Condition Variables.

THREAD IMPLEMENTATION For parallel processing. Steps involved Creation Creates a thread with a thread id. Detach and Join All threads must be detached.

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

The University of Adelaide, School of Computer Science

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Thread Synchronization with Semaphores

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

Today’s topic Pthread Some materials and figures are obtained from the POSIX threads Programming tutorial at

Operating Systems CMPSC 473 Multi-threading models Tutorial on pthreads Lecture 10: September Instructor: Bhuvan Urgaonkar.

June-Hyun, Moon Computer Communications LAB., Kwangwoon University Chapter 26 - Threads.

Multi-threaded Programming with POSIX Threads CSE331 Operating Systems Design.

Threads and Thread Control Thread Concepts Pthread Creation and Termination Pthread synchronization Threads and Signals.

CS345 Operating Systems Threads Assignment 3. Process vs. Thread process: an address space with 1 or more threads executing within that address space,

Programming with POSIX* Threads Intel Software College.

Professor: Shu-Ching Chen TA: Samira Pouyanfar.  An independent stream of instructions that can be scheduled to run  A path of execution int a, b; int.

Pthreads: A shared memory programming model

1 Pthread Programming CIS450 Winter 2003 Professor Jinhua Guo.

POSIX Synchronization Introduction to Operating Systems: Discussion Module 5.

POSIX Threads HUJI Spring 2011.

IT 325 Operating systems Chapter6.  Threads can greatly simplify writing elegant and efficient programs.  However, there are problems when multiple.

Chapter 6 P-Threads. Names The naming convention for a method/function/operation is: – pthread_thing_operation(..) – Where thing is the object used (such.

Unix Internals Concurrent Programming. Unix Processes Processes contain information about program resources and program execution state, including: Process.

CSC 360, Instructor: Kui Wu Thread & PThread. CSC 360, Instructor: Kui Wu Agenda 1.What is thread? 2.User vs kernel threads 3.Thread models 4.Thread issues.

Threads A thread is an alternative model of program execution

PThread Synchronization. Thread Mechanisms Birrell identifies four mechanisms commonly used in threading systems –Thread creation –Mutual exclusion (mutex)

Thread S04, Recitation, Section A Thread Memory Model Thread Interfaces (System Calls) Thread Safety (Pitfalls of Using Thread) Racing Semaphore.

Thread Basic Thread operations include thread creation, termination, synchronization, data management Threads in the same process share:  Process address.

POSIX THREADS. What is a thread? Multiple stands of execution in a single program are called threads. In other words, a thread is a sequence of control.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

Chapter 7 Process Environment Chien-Chung Shen CIS/UD

1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.

Case Study: Pthread Synchronization Dr. Yingwu Zhu.

Tutorial 4. In this tutorial session we’ll see Threads.

Protection of System Resources

Shared-Memory Programming with Threads

Threads Threads.

Boost String API & Threads

Copyright ©: Nahrstedt, Angrave, Abdelzaher

PTHREADS These notes are from LLNL Pthreads Tutorial

Linux Processes & Threads

Multithreading Tutorial

Principles of Operating Systems Lecture 8

Lecture 14: Pthreads Mutex and Condition Variables

Multithreading Tutorial

Jonathan Walpole Computer Science Portland State University

Pthread Prof. Ikjun Yeom TA – Mugyo

Multithreading Tutorial

Programming with Shared Memory

Multithreading Tutorial

Lecture 14: Pthreads Mutex and Condition Variables

Programming with Shared Memory

Programming with Shared Memory - 2 Issues with sharing data

POSIX Threads(pthreads)

Presentation transcript:

Portable Operating System Interface Thread Yukai Hung Department of Mathematics National Taiwan University Yukai Hung Department of Mathematics National Taiwan University

POSIX Thread Basic

3 What is process? What is thread? - a thread of execution is the smallest unit of processing that can be scheduled by operating system, which is contained inside a process - multiple threads can exist within the same process and share resources, while different processes do not share the resources How to create new process? - use system function fork(), which creates a copy of itself - parent and child process can tell each other apart by examining the return value of fork() system function (non-zero or zero value)

4 POSIX Thread Basic int pthread_create(…) create new thread with specified thread attributes and execute thread function with specified function arguments void pthread_exit(…) terminate the current calling thread and makes the return value pointer available to any successful join with the terminating thread int pthread_join(…) suspend the execution of the current calling thread or process until the target thread terminates, unless the target thread has already terminated

5 POSIX Thread Basic #include int main(int argc,char** argv) { int error1; int error2; int input1; int input2; int return1; int return2; pthread_t thread1; pthread_t thread2; input1=1; input2=2; error1=pthread_create(&thread1,NULL,tfunction,(void*)&input1); error2=pthread_create(&thread2,NULL,tfunction,(void*)&input2);

6 POSIX Thread Basic if(error1!=0||error2!=0) printf(“Error:thread create\n”); error1=pthread_join(thread1,(void*)&return1); error2=pthread_join(thread2,(void*)&return2); if(error1!=0||error2!=0) printf(“Error:thread join\n”); printf(“thread 1 return %d\n”,return1)); printf(“thread 2 return %d\n”,return2)); return 0; }

7 POSIX Thread Basic void* tfunction(void* input) { printf(“thread %d is executing\n”,*((int*)input)); pthread_exit((void*)1); }

8 POSIX Thread Basic int pthread_equal(…) compare two threads from two thread handles pthread_t pthread_self(…) return the thread handle of the current calling thread int pthread_cancel(…) request the thread be canceled, the target threads cancelability states and types determines when the cancellation takes effects

9 POSIX Thread Basic void pthread_cleanup_push(…) the function shall push the specified cancellation cleanup handler handler routine onto the calling threads cancellation cleanup stack void pthread_cleanup_pop(…) the function shall remove the routine at the top of calling cleanup thread cancellation stack and optionally invoke it (if input is non-zero)

10 POSIX Thread Basic #include int main(int argc,char** argv) { int rvalue; pthread_t thread; if(pthread_create(&thread,NULL,tfunction,(void*)1)!=0) printf(“Error:thread create\n”); if(pthread_join(thread,(void*)&rvalue)!=0) printf(“Error:thread join\n”); printf(“thread return %d\n”,rvalue)); return 0; }

11 POSIX Thread Basic void* tfunction(void* input) { printf(“thread start\n”); pthread_cleanup_push(cleanup,"thread first handler"); pthread_cleanup_push(cleanup,"thread second handler"); printf("thread push complete\n"); pthread_cleanup_pop(1); return (void*)1; } void cleanup(void* string) { printf(“cleanup:%s\n”,(char*)string); return; }

Race Condition and Mutex Lock

13 Consider the following parallel program - threads are almost impossibly executed at the same time Race Condition and Mutex Lock

14 Scenario 1 - the result value R is 2 if the initial value R is 1 Race Condition and Mutex Lock

15 Scenario 2 - the result value R is 2 if the initial value R is 1 Race Condition and Mutex Lock

16 Scenario 3 - the result value R is 3 if the initial value R is 1 Race Condition and Mutex Lock

17 Solve the race condition by Locking - manage the shared resource between threads - avoid the deadlock or unbalanced problems Race Condition and Mutex Lock

18 Guarantee the executed instruction order is correct - the problem is back to the sequential procedure - lock and release procedure have high overhead Race Condition and Mutex Lock

19 Solve the race condition by Semaphore - multi-value locking method (binary locking extension) - instructions in procedure P and V are atomic operations Race Condition and Mutex Lock

20 Race Condition and Mutex Lock #include int main(int argc,char** argv) { int value; int error1; int error2; pthread_t thread1; pthread_t thread2; value=0; error1=pthread_create(&thread1,NULL,tfunction,(void*)&value); error2=pthread_create(&thread2,NULL,tfunction,(void*)&value); if(error1!=0||error2!=0) printf(“Error:thread create\n”);

21 Race Condition and Mutex Lock error1=pthread_join(thread1,NULL); error2=pthread_join(thread2,NULL); if(error1!=0||error2!=0) printf(“Error:thread join\n”); printf(“final result is %d\n”,value)); return 0; } void* tfunction(void* input) { *((int*)input)=*((int*)input)+1; return NULL; }

22 Race Condition and Mutex Lock int pthread_mutex_init(…) initialize the mutex referenced by mutex with specified attributes initialize an already initialized mutex results in undefined behavior int pthread_mutex_destroy(…) destroy the previously initialized mutex lock the mutex must not be used after it has been destroyed

23 Race Condition and Mutex Lock int pthread_mutex_lock(…) lock the specified initialized mutex. if the mutex is already locked, the calling thread blocks until he mutex becomes available or unlock int pthread_mutex_unlock(…) attempt to unlock the specified mutex. If there are threads blocked on the mutex object when unlock function is calling, resulting in the mutex becoming available the scheduling policy is used to determine which thread acquire the mutex int pthread_mutex_trylock(…) try to lock the specified mutex. If the mutex is already locked, an error is returned, otherwise, the operation returns with the mutex in the locked state with the calling thread as its owner

24 Race Condition and Mutex Lock #include pthread_mutex_t work_mutex; int main(int argc,char** argv) { int value; int error1; int error2; pthread_t thread1; pthread_t thread2; value=0; if(pthread_mutex_init(&work_mutex,NULL)!=0) printf(“Error:work mutex create\n”);

25 Race Condition and Mutex Lock error1=pthread_create(&thread1,NULL,tfunction,(void*)&value); error2=pthread_create(&thread2,NULL,tfunction,(void*)&value); if(error1!=0||error2!=0) printf(“Error:thread create\n”); error1=pthread_join(thread1,NULL); error2=pthread_join(thread2,NULL); if(error1!=0||error2!=0) printf(“Error:thread join\n”); printf(“final result is %d\n”,value); if(pthread_mutex_destroy(&work_mutex)!=0) printf(“Error:work mutex destroy\n”); return 0; }

26 Race Condition and Mutex Lock void* tfunction(void* input) { int* value; if(pthread_mutex_lock(&work_mutex)!=0) printf(“Error:lock work mutex\n”); *((int*)input)=*((int*)input)+1; if(pthread_mutex_unlock(&work_mutex)!=0) printf(“Error:work mutex unlock\n”); return NULL; }

Signal and Condition Variable

28 Signal and Condition Variable int pthread_cond_init(…) initialize the condition variable referenced by cond with specified attributes initialize an already initialized condition variable results in undefined behavior int pthread_cond_destroy(…) destroy the previously initialized condition variable the condition variable must not be used after it has been destroyed

29 Signal and Condition Variable int loop=1; pthread_cond_t cond; pthread_mutex_t mutex; int main(int argc,char** argv) { pthread_t thread1; pthread_t thread2; pthread_cond_init(&cond,NULL); pthread_mutex_init(&mutex,NULL); pthread_create(&thread1,NULL,fthread1,(void *)NULL); pthread_create(&thread2,NULL,fthread2,(void *)NULL); pthread_join(thread1,NULL); pthread_join(thread2,NULL); pthread_cond_destroy(&cond); pthread_mutex_destroy(&mutex); return 0; }

30 Signal and Condition Variable void* fthread1(void* input) { for(loop=1;loop<=9;loop++) { pthread_mutex_lock(&mutex); if(loop%3==0) pthread_cond_signal(&cond); else printf("thread1:%d\n",loop); pthread_mutex_unlock(&mutex); sleep(1); } return NULL; };

31 Signal and Condition Variable void* fthread2(void* input) { while(loop<9) { pthread_mutex_lock(&mutex); if(loop%3!=0) pthread_cond_wait(&cond,&mutex); printf("thread2:%d\n",loop); pthread_mutex_unlock(&mutex); sleep(1); } return NULL; };

Multiple Thread and Multiple GPU

33 Multiple Thread and Multiple GPU A host thread can maintain one context at a time - need as many host threads as GPUs to maintain all device - multiple host threads can establish context with the same GPU hardware diver handles time-sharing and resource partitioning host thread 0host thread 1host thread 2 host memory device 0device 1device 2

34 Multiple Thread and Multiple GPU cudaGetDeviceCount() returns the number of devices on the current system with compute, capability greater or equal to 1.0, that are available for execution cudaSetDevice() set the specific device on which the active host thread executes the device code. If the host thread has already initialized he cuda runtime by calling non-device management runtime functions, returns error must be called prior to context creation, fails if the context has already been established, one can forces the context creation with cudaFree(0) cudaGetDevice(…) returns the device on which the active host thread executes the code

35 Multiple Thread and Multiple GPU #include #define MaxDevice 8 int main(int argc,char** argv) { int size; int loop; int devicecount; float* h_veca; float* h_vecb; float* h_vecc; pthread_t threadt[MaxDevice]; pthread_c threadc[MaxDevice]; size=32000*4; h_veca=(float*)malloc(sizeof(float)*size); h_vecb=(float*)malloc(sizeof(float)*size); h_vecc=(float*)malloc(sizeof(float)*size);

36 Multiple Thread and Multiple GPU for(loop=0;loop<size;loop++) { h_veca[loop]=1.0f; h_vecb[loop]=2.0f; h_vecc[loop]=0.0f; } cudaGetDeviceCount(&devicecount); devicecount=(devicecount>MaxDevice)?MaxDevice:devicecount; printf(“device number is %d\n”,devicecount); for(loop=0;loop<devicecount;loop++) { threadc[loop].index=loop; threadc[loop].subsz=size/devicecount; threadc[loop].hveca=h_veca+loop*subsz; threadc[loop].hvecb=h_vecb+loop*subsz; threadc[loop].hvecc=h_vecc+loop*subsz; } for(loop=0;loop<devicecount;loop++) pthread_create(threadt+loop,NULL,tfunction,(void*)(threadc+loop));

37 Multiple Thread and Multiple GPU for(loop=0;loop<devicecount;loop++) pthread_join(threadt[loop],NULL); for(loop=0;loop<size;loop++) if(h_vecc[loop]!=3.0f) printf(“Error:check result\n”); free(h_veca); free(h_vecb); free(h_vecc); return 0; }; struct pthread_c { int index; int subsz; float* hveca; float* hvecb; float* hvecc; };

38 Multiple Thread and Multiple GPU void* tfunction(void* content) { int index; int subsz; int gsize; int bsize; float *hveca,*dveca; float *hvecb,*dvecb; float *hvecc,*dvecc; index=(*((pthread_c*)content)).index; subsz=(*((pthread_c*)content)).subsz; hveca=(*((pthread_c*)content)).hveca; hvecb=(*((pthread_c*)content)).hvecb; hvecc=(*((pthread_c*)content)).hvecc; printf(“thread %d start!\n”,index); //for(int loop=0;loop<subsz;loop++) //hvecc[loop]=hveca[loop]+hvecb[loop]; cudaSetDevice(index);

39 Multiple Thread and Multiple GPU cudaMalloc((void**)&dveca,sizeof(float)*subsz); cudaMalloc((void**)&dvecb,sizeof(float)*subsz); cudaMalloc((void**)&dvecc,sizeof(float)*subsz); cudaMemcpy(dveca,hveca,sizeof(float)*subsz,cudaMemcpyHostToDevice); cudaMemcpy(dvecb,hvecb,sizeof(float)*subsz,cudaMemcpyHostToDevice); bsize=256; gsize=(int)ceil((float)subsz/256); vecAdd >>(dveca,dvecb,dvecc,subsz); cudaMemcpy(hvecc,dvecc,sizeof(float)*subsz,cudaMemcpyDeviceToHost); cudaFree(dveca); cudaFree(dvecb); cudaFree(dvecc); cudaError_t error; if((error=cudaGetLastError())!=cudaSuccess) printf(“cudaError:%s\n”,cudaGetErrorString(error)); printf(“thread %d finish!\n”,index); return NULL; };

40 Multiple Thread and Multiple GPU __global__ void vecAdd(float* veca,float* vecb,float* vecc,int size) { int index; index=blockIdx.x*blockDim.x+threadIdx.x; if(index<size) vecc[index]=veca[index]+vecb[index]; return; };

41 Multiple Thread and Multiple GPU Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles

42 Multiple Thread and Multiple GPU How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!) //declare constant memory __constant__ float cst_ptr[size]; //copy data from host to constant memory cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

43 Multiple Thread and Multiple GPU //declare constant memory __constant__ float cangle[360]; int main(int argc,char** argv) { int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size); //initialize allocated memory cudaMemset(darray,0,sizeof(float)*size); //initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f; //copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

44 Constant Memory //execute device kernel test_kernel >>(darray); //free device memory cudaFree(darray); return 0; } __global__ void test_kernel(float* darray) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; #pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return; };

45 Multiple Thread and Multiple GPU #include #define MaxDevice 8 __constant__ float cangle[360]; int main(int argc,char** argv) { int loop; int devicecount; float summation; float hangle[360]; pthread_t threadt[MaxDevice]; pthread_c threadc[MaxDevice]; for(loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f; for(loop=0,summation=0.0f;loop<360;loop++) summation=summation+hangle[loop];

46 Multiple Thread and Multiple GPU cudaGetDeviceCount(&devicecount); devicecount=(devicecount>MaxDevice)?MaxDevice:devicecount; for(loop=0;loop<devicecount;loop++) { threadc[loop].index=loop; threadc[loop].hangle=hangle; threadc[loop].summation=summation; } for(loop=0;loop<devicecount;loop++) pthread_create(threadt+loop,NULL,tfunction,(void*)(threadc+loop)); for(loop=0;loop<devicecount;loop++) pthread_join(threadt[loop],NULL); return 0; } struct pthread_c { int index; float* hangle; float summation; };

47 Multiple Thread and Multiple GPU void* tfunction(void* content) { int size; int index; int gsize; int bsize; float summation; float* hangle; float* hvector; float* dvector; size=32000; index=(*((pthread_c*)content)).index; hangle=(*((pthread_c*)content)).hangle; summation=(*((pthread_c*)content)).summation; printf(“thread %d start!\n”,index); cudaSetDevice(index); cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

48 Multiple Thread and Multiple GPU hvector=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&dvector,sizeof(float)*size); bsize=256; gsize=(int)ceil((float)size/256); kernel >>(dvector,size); cudaMemcpy(hvector,dvector,sizeof(float)*size,cudaMemcpyDeviceToHost); for(loop=0;loop<size;loop++) if(hvector[loop]!=summation) printf("Error: check result\n"); free(hvector); cudaFree(dvector); cudaError_t error; if((error=cudaGetLastError())!=cudaSuccess) printf(“cudaError:%s\n”,cudaGetErrorString(error)); printf(“thread %d finish!\n”,index); return NULL; };

49 Multiple Thread and Multiple GPU __global__ void kernel(float* dvector,int size) { int loop; int index; float temp; index=blockIdx.x*blockDim.x+threadIdx.x; if(index<size) { for(loop=0,temp=0.0f;loop<360;loop++) temp=temp+cangle[loop]; *(dvector+index)=temp; } return; };