Lecture 2 CUDA C Programming

Lecture 2 CUDA C Programming
Kyu Ho Park March 8, 2016 Lecture 2 CUDA C Programming

Computer Architecture
Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) Single Instruction Single Data (SISD) Multiple Instruction Single Data (MISD) NVIDIA GPU: SIMT(Single Instruction Multiple Thread)

NVIDIA’s GPU Tegra For mobile and embedded devices such as tablets and phones. GeForce For consumer graphics. Quadro For professionl visualization. Tesla For data center parallel computing. Fermi: GPU accelerator in the Tesla product family. Kepler: the current generation of GPU computing architecture after Fermi.

Fermi and Kepler FERMI (Tesla c2050) KEPLER (Tesla k10) CUDA cores 448
1536 x 2 Memory 6GBytes 8GBytes Peak Performance 1.0 TFLOPS 4.6 TFLOPS Memory Bandwidth 144GBytes/s 320 Gbytes/s

CUDA Architecture 2006: Advent of NVIDIA’s GeForce 8800GTX
the first GPU of CUDA Architecture for general purpose computing, ALUs complying IEEE requirements, allowing R/W of shared memory Few months after the launch of GeForce 8800GTX, CUDA C announced. the first language for general-purpose computing on GPU.

Hello,World! #include <stdio.h> int main(void){ printf(“Hello, World!\n”); return 0; }

Kernel Program #include <stdio.h> __global__ void funct(void) { printf(“Hello from GPU!\n”); } int main(void){ funct<<<1,4>>>( ); printf(“Hello, World from CPU!\n”); cudaDeviceReset(); return 0; Angle bracket

Terminology Host: CPU and the system’s memory
Device: GPU and its memory Kernel: A function that executes on the device.

__global__ qualifier CUDA C adds the __global__ qualifier to standard C It alerts the compiler that a function is to be compiled to run on a device. ‘nvcc’ sends the funct( ) to the compiler that handles device code. And it feeds main( ) to the host compiler. CUDA nvcc compiler separates the device code from the host during the compilation process.

CUDA NVCC CUDA Libraries Integrated CPU and GPU code
CUDA Assembly for Computing CPU Host Code CUDA Driver / Debugger C Compiler GPU CPU CUDA Compiler

CUDA C Extension function type qualifiers
__global__, __device__, __host__, __device__ __host__ __global__ void function1 <<< 6,4>>>( ) { …} function1 is only executed at a device. It can be called from the host but cannot be called recursively at the device.

__global__ void function1
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio .h> __global__ void add(int a, int b, int *c) { *c = a + b; } int main() { int c; int *dev_c; cudaMalloc((void**)&dev_c, sizeof(int)); add<<<1, 1>>>(2, 7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf("2 + 7 = %d\n", c); cudaFree(dev_c); return 0;

__global__ void function<<< 6,4>>>
return value is always void, recursive call is not allowed, function cannot have static variables, can not declare __host__ at the same time, declared 6 blocks and 4 threads per block.

__device__ int function( )
The function is executed at a device, It can be called by the device and cannot be called by the host, It can not be called recursively, It can not have variable number of arguments,

__host__ int function (int a, int b)
It is executed only at a host, It can not be called by the device, If __global__, __host__, __device__ are not declared, it is considered as __host__, It cat not be used with __global__ , It can be declared with __device__ , in this case the function can be executed at the host and the device.

variable type qualifiers
__device__ the variable is allocated to the global memory, effective until the end of program execution, all threads can access the variable, the host can access it through API function.

__constant__ 2. __constant__
the variable is allocated to the constant memory area, all can read only the variable, the host can write the value through cudaMemcpyToSymbol( ) API.

__shared__ 3. __shared__ Question: Global Memory, Constant Memory,
the variable is allocated to the shared memory area of a block, all threads in a block can read and write the variable. Question: Global Memory, Constant Memory, Shared Memory?

Memory Management C Functions CUDA C Functions malloc cudaMalloc
memcpy cudaMemcpy memset cudaMemset free cudaFree

Allocation, Deallocation and Copying of Graphic Card Memory(GCM)
1. Memory Allocation of GCM cudaError_t cudaMalloc(void** devPtr, size_t count); 2. Freeing GCM cudaError_t cudaFree(void* devPtr); 3. Copying GCM cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyHostToDevice);

Every CUDA call returns cudaSuccess or
cudaErrorMemoryAllocation

cudaMemcpy( ) Last parameter of cudaMemcpy( ) Parameter Action
cudaMemcpyHostToHost Copy from MM to MM cudaMemcpyHostToDevice Copy from MM to GCM cudaMemcpyDeviceToHost Copy from GCM to MM cudaMemcpyDeviceToDevice Copy from GCM to GCM

GPU Parallel Processing

Processing Procedure Allocation of I/O Main Memory
Allocation of I/O GCM Input data to Main Memory Copy input data from MM to GCM Distribute the data to the memory of GPU Process by executing threads in parallel Load the process data to the output GCM Send the output data of GCM to MM Free GCM Free MM

Host and Device Host: The CPU and the System’s Memory
Device: the GPU and its memory

Kernel Function of CUDA
A function that executes on the device is called a kernel. _ _ global_ _ void KernelFunction(int a , int b, int c) { …. } /* Qualifier _ _ global_ _ : It alerts the compiler that a function should be compiled to run on a device instead of the host. No return value: void */

Calling a KernelFunction
#include <stdio.h> __global__ void kernel(int a, int b){ int sum=a+b; } int main() { kernel<<<6, 1>>>(10,21); printf(…);

Kernel<<<6,1>>>(10,21)
<<<6,1>>> ; the number of blocks = 6, the number of thread(s) per block=1.

Thread-Block-Grid Model
<<4,4>>

Memory Hierarchyand Thread-Block-Grid
Shared Memory Host Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Global Memory cudaMalloc cudaMemcpy cudaMemset cudaFree

CUDA Processor Architecture

Streaming Processor(SP)

SM(Streaming Multiprocessor)

Supplementary: Process, Thread and Task

How to create a process. How to create a thread. What is the task?

creating a process

Process Concept An operating system executes a variety of programs:
Batch system – jobs Time-shared systems – user programs or tasks Process – a program in execution; process execution must progress in sequential fashion A process includes: program counter stack data section In a batch system, a single job is executed at a time. To get a high utilization of CPU, the multiprogramming system had appeared. Also give fair opportunity to all processes, time sharing system was made. In MS window which is a single user system, a word processor, a web browser, e_mail process are running. At the same time, memory management process is running in the OS.

Process in Memory Stack SP Heap Data Text PC Program: text part
Data: global variables Heap: Dynamic data area such as allocated during program execution of ‘new’ and ‘malloc’. Stack: temporary data such as function parameters, return address, local variables. Process

Memory Map of a process

Memory Map of a process 4GB 3GB Stack heap data text

Diagram of Process State
terminated running ready new admitted waiting interrupt scheduler dispatch I/O or event wait I/O or event completion exit

Process Control Block (PCB)
Information associated with each process Process state Program counter CPU registers CPU scheduling information Memory-management information Accounting information I/O status information Each process is represented in the OS by a PCB. PCB is also called a task control block. CPU scheduling information: priority

Process Control Block(PCB)
Process management Registers Program counter Program status word Stack pointer Process state Time when process started CPU time used Children’s CPU time Time of next alarm Message queue pointers Pending signal bits Process id Various flag bits Memory management Pointer to text segment Pointer to data segment Pointer to bss segment Exit status Signal status Parent process Process group Real uid Effective Real gid Effective gid Bit maps for signals Files management UMASK mask Root directory Working directory File descriptors Effective uid System call parameters Some of the fields of the MINIX process table PCB is given one entry per process. PCB is all and every information that must be saved when the process is switched from running to ready state so that it can be restarted later as if it had never stopped before.

Process Descriptor Process Descriptor Process Switching Creating Processes Destroying Processes Overview of the Linux Process Descriptor ( struct task_struct ) This seminar (chapter 3) focus on Process State TASK_RUNNING TASK_INTERRUPTIBLE TASK_UNINTERRUPTIBLE TASK_STOPPED TASK_TRACED TASK_ZOMBIE EXIT_DEAD Process parent/child Relationship

Task List Representation of a process: A process descriptor of the type struct task_struct //<linux/sched.h> 1.7KBytes on a 32-bit machine Task list: A circular doubly linked list of task_struct

CPU Switch From Process to Process
Save state into PCB0 reload from PCB1 Save to PCB1 Reload from PCB0 OS P 1 idle Make a process run means that the state of PCB is changed to ‘running’ from ‘ready’ and the PC value of the PCB is loaded to the PC of CPU. User mode/ Kernel Mode

Operations on Processes: Process Creation
Parent process create children processes, which, in turn create other processes, forming a tree of processes Possibilities of Resource sharing Parent and children share all resources Children share subset of parent’s resources Parent and child share no resources Possibilities of Execution Parent and children execute concurrently Parent waits until children terminate

Process Creation (Cont.)
Possibilities of Address space Child duplicate of parent Child has a program loaded into it UNIX examples fork system call creates a new process exec system call used after a fork to replace the process’ memory space with a new program

fork() Initial Process fork() Returns a new pid(child process) Returns
zero Original Process Continues New Process:Child

fork() Child Parent PC PC PC Registers SP PC Stack Resources
Heap BSS Data Text . fork( ) SP PC open files Registers Resources pid=1000 ID PC Stack Heap BSS Data Text . fork( ) SP PC open files Registers Resources pid=1001 ID Stack Heap BSS Data Text . fork( ) SP PC open files Registers Resources pid=1000 ID Child Parent PC PC

fork()

forkOut

fork() pid=999; fork() pid=1000; running parent pid=1234; pid=1345;
child pid=1234; pid=1345; pid=1452; ready

fork() int main(){ int i; for(i=0; i<10; i++){ printf(“Process_id=%d, i=%d\n”, getpid(), i); if(i==5){ printf(“Process_id=%d: fork() to start\n”,getpid()); int forkValue=fork(); printf(“forkValue=%d\n”, forkValue); }

fork( ) output

Duplicating a process image
#include <sys/types.h> #include <unistd.h> pid_t fork(void); -return vaule of fork() : pid of the child if successful to the parent, 0 to the child. If failed, -1 to the parent. unistd.h : fork() prototype is defined. types.h : pid_t is defined.

fork() Original process PC pid ==0 pid > 0 Parent process
#include <unistd.h> #include <sys/types.h> #include <stdio.h> #include <stdlib.h> main() { pid_t pid; printf("Start!\n"); pid = fork(); if( pid == 0) printf("I am the Child !\n"); else if (pid > 0) printf("Parent pid=%d, Child pid=%d\n", (int)getpid(),pid); else printf("fork() failed!\n");} PC pid ==0 pid > 0 Parent process Child process #include <unistd.h> #include <sys/types.h> #include <stdio.h> #include <stdlib.h> main() { pid_t pid; printf("Start!\n"); pid = fork(); if( pid == 0) printf("I am the Child !\n"); else if (pid > 0) printf("Parent pid=%d, Child pid=%d\n", (int)getpid(),pid); else printf("fork() failed!\n");} #include <unistd.h> #include <sys/types.h> #include <stdio.h> #include <stdlib.h> main() { pid_t pid; printf("Start!\n"); pid = fork(); if( pid == 0) printf("I am the Child !\n"); else if (pid > 0) printf("Parent pid=%d, Child pid=%d\n", (int)getpid(),pid); else printf("fork() failed!\n");} PC+1 PC+1

Replacing a process image
#include <unistd.h> char **environ; int execl(const char *path, const char *arg0, …,(char *)0); int execlp(const char *file, const char *arg0,…,(char *)0); int execle(const char *path, const char *arg0,…,(char *)0,char *const envp[]); int execv(const char *path, char *const argv[]); int execvp(const char *file, char *const argv[]); int execve(const char *path, char *const argv[],char *const envp[]);

Examples of exe*() #include <unistd.h> char *const ls_argv[]={“ls”,”-l”,0}; char *const ls_envp[]={“PATH=bin:/usr/bin”,”TERM=console”,0}; execl(“/bin/ls”,”ls”,”-l”,0); execlp(“ls”,”ls”,”-l”,0); execle(“/bin/ls”,”ls”,”-l”,0,ls_envp); execv(“/bin/ps”,ls_argv); execvp(“ls”,ls_argv); execve(“/bin/ls”, ls_argv, ls_envp);

Waiting for a process #include <sys/types.h> #include <sys/wait.h> pid_t wait(int *status); The parent process executing wait(), pauses until its child process stops. The call returns the pid of the child process

status status: it is the value transferred to the parent by exit(int status). Parent process: wait(&status) 0x00 0x64 Child process: exit(1); 0x00 0x64 exit(1) ./forkwait Start! Parent pid=25820, Child pid=25821 I am the Child ! status=256

If(pid !=0){ int stat; pid_t pid_child; pid_child = wait(&status); printf(“Child has finished,pid_child=%d\n”,pid_child); if(status !=0) printf(“Child finished normally\n”); else printf(“Child finished abnormally\n”); }

wait(&status) forkwait.c ./forkwait #include <unistd.h>
#include <sys/types.h> #include <stdio.h> #include <stdlib.h> main() { pid_t pid; int status; printf("Start!\n"); pid = fork(); if( pid == 0) { printf("I am the Child !\n"); exit(100); } else if (pid > 0){ printf("Parent pid=%d, Child pid=%d\n", (int)getpid(),pid); wait(&status); printf("status=%d\n", status); else printf("fork() failed!\n"); ./forkwait ./forkwait Start! Parent pid=25789, Child pid=25790 I am the Child ! status=25600 vi forkwait.c

exit() #include <stdlib.h> void exit(int status); /* Terminating current process and transfer the status to the parent. The value of status ranges from 0 to 255 integer value. */

waitpid( ) #include <sys/types.h> #include <sys/wait.h> pid_t waitpid(pid_t pid, int *status, int options); pid : pid of the child, status : transferred from the child executing exit(status), options 0 : usual wait state, that is , the parent waits until the child process finishes, WNOHANG :the parent process does not stay at ‘wait’ state.

Zombie Processes Terminated running admitted interrupt
ready new admitted waiting interrupt scheduler dispatch I/O or event wait I/O or event completion exit Exit_Zombie Wait( ) TTerminated

forkZombie.c

Threads So far a process is a single thread of execution.
The single thread of control allows the process to perform only one task . The user cannot simultaneously type in characters and run the spell checker with the same process. Therefore modern OSs have extended the process concept to allow a process to have multiple threads of execution. In a traditional process, there is a single thread of control and a single PC in each process. However in modern OSs, support is provided for multiple threads of control. The thread will be discussed in Lecture 3. 71

Process and Threads Computer Program Thread Process counter
(b) Three processes each with one thread. One process with three threads.

Thread Usage[Tanenbaum]
A word processor with three threads

Linux Implementation of Threads
In the Linux, each thread has a unique task_struct . Linux implements all threads as standard processes. A thread in Linux is just a process that shares certain resources( such as an address space).

A Thread Stack Heap BSS Data Text SP PC open files Registers Resources
. open files Registers Resources

Sharing of threads Registers Thread1 Stack1 SP PC Stack2 Thread2 Heap
open files . SP PC Registers Resources Stack1 Heap BSS Data Stack2 Text Thread1 Thread2

Creating Threads: clone( )
#include <sched.h> int clone( int (*func)(void *), void *child_stack, int flags, void *func_arg,…); /* The caller must allocate a suitably sized block in the argument child_stack. The stack grows downward, so the child_stack argument should point to the high end of the allocated block. */ int main() { int child_stack[4096]; …. clone(thread_func, (void *)(child_stack+4095), CLONE_THREAD, NULL); ….. } int thread_func(void *arg) { If the flag is set to CLONE_CHILD_CLEARID|CLONE_CHILD_SETTID, the creates one is a process.

clone() example #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <linux/unistd.h> #include <sched.h> int thread_func(void *arg) { printf("Child :TGID(%d), PID(%d)\n",(int) getpid(),(int) syscall(__NR_gettid)); sleep(1); return 0; } int main(void) int pid; int child_stack[4096]; printf("Start!\n"); printf("Parent:TGID(%d), PID(%d)\n",(int)getpid(),(int) syscall(__NR_gettid)); clone (thread_func, (void *)(child_stack+4095), CLONE_VM | CLONE_THREAD | CLONE_SIGHAND, NULL); printf("End!\n");

pthread example if((pid = pthread_create(&p_thread[0], NULL, pthread_func, (void*)&a)) < 0){ perror("Creation0 failed! "); exit(1); } if((pid = pthread_create(&p_thread[1], NULL, pthread_func, (void*)&b)) < 0){ perror("Creation1 failed"); exit(2); pthread_join(p_thread[0], (void **)&status); printf("pthread_join0\n"); pthread_join(p_thread[1], (void **)&status); printf("pthread_join1\n"); printf(“The Parent.\n”); printf("End!\n"); return 0; #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <linux/unistd.h> #include <pthread.h> void *pthread_func(void *data) { int id; int i=0; pthread_t t_id; id = *((int *)data); printf(“I am the created pthread.\n”); sleep(2); } int main(void) int pid, status; int a = 1; int b = 2; pthread_t p_thread[2]; printf("Start!\n");

POSIX Threads (pthread)
Embedded Software Kyu Ho Park CORE Lab. POSIX Threads (pthread)

Pthread and LinuxThread
Thread got attention with the advent of the POSIX 1003.c specification. But not so popular. LinuxThreads which was very close to the POSIX standard: POSIX Thread: NPTL(Native POSIX Thread Library) NGPT(New Generation POSIX Threads) NPTL POSIX(Portable OS Interface) UNIX

Drawbacks of thread program
Difficult to write multi-threaded programs. Difficult to debug multi-threaded programs.

pthread functions #include <pthread.h> int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg); void pthread_exit(void *retval); Int pthread_join(pthread_t th, void **thread_return); //equivalent to wait( ).

pthread_create, exit, and join
Master_thread pthread_create() pthread_join() created_ threads pthread_exit()

pthread

pthread_create

Task ? In Linux, Processes and threads are all represented by task_struct. Every task has its own unique id and it is represented at pid field of <task_struct>. But, according to the POSIX standard, all threads of a process shoud have the same pid. Linux adopts tgid(thread group id) to satisfy the POSIX standard. Therefore,a task if it is a process : pid == tgid, if it is a thread : pid != tgid and the child’s tgid has the same tgid of the parent.

Lecture 2 CUDA C Programming

Similar presentations

Presentation on theme: "Lecture 2 CUDA C Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2 CUDA C Programming

Similar presentations

Presentation on theme: "Lecture 2 CUDA C Programming"— Presentation transcript:

Similar presentations

About project

Feedback