Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Краткое введение в программирование на языке OpenCL.
Lecture 11 – Related Programming Models: OpenCL
GPU Programming using OpenCL
© 2012 Elsevier, Inc. All rights reserved.
OpenCL introduction.
OpenCL introduction III.
6- General Purpose GPU Programming
Presentation transcript:

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий

What OpenCL is Open Computing Language (OpenCL) is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL is based on C99. Credit:

How OpenCL works OpenCL consists of a kernel programming API used to program a co-processor or GPU and a run-time host API used to coordinate the execution of these kernels and perform other operations such as memory synchronization. The OpenCL programming model is based on the parallel execution of a kernel over many threads to exploit SIMD or/and SIMT architectures. SIMD - Single Instruction, Multiple Data SIMT - Single Instruction, Multiple Threads Single Instruction is realized in a kernel. Multiple Data is hosted in data massives. Multiple Threads parallely execute a kernel. The same code is executed in parallel mode by a different thread, and each thread executes the code with different data.

OpenCL Programming model (1) Credit:

OpenCL Programming model (2) The work-items are the smallest execution entity (equivalent to the CUDA threads). Every time a kernel is launched, lots of work-items (a number specified by the programmer) are launched executing the same code. Each work-item has an ID, which is accessible from a kernel, and which is used to distinguish the data to be processed by each work-item. The work-groups exist to allow communication and cooperation between work-items (equivalent to CUDA thread blocks). They reflect how work-items are organized (it's a N-dimensional grid of work-groups, with N = 1, 2 or 3). Work-groups also have an unique ID that can be refereed from a kernel. The ND-Range is the next organization level, specifying how work- groups are organized (again, as a N-dimensional grid of work- groups, N = 1, 2 or 3);

OpenCL Memory Hierarchy (1) Rob Farber, OpenCLCUDA Global LocalShared PrivateRegister

OpenCL Memory Hierarchy (2) Following are the OpenCL address qualifiers (which are distinct from access qualifiers): __global: memory allocated from global address space __constant: a region of read-only memory __local: memory shared by work-group __private: private per work-item memory Note: kernel arguments have to be __global, __constant or __local. Pointers that are cast using different address qualifiers are undefined. Rob Farber,

OpenCL Program Structure (1) 1. Get a platform information clGetPlatformIDs(1, &platform_id, &num_platforms); 2. Get information about devices clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &num_devices); 3. Create an OpenCL context context = clCreateContext(NULL, CL_DEVICE_TYPE_GPU, &device_id, NULL, NULL, NULL); 4. Create a command queue queue = clCreateCommandQueue(context, device_id, 0, NULL);

OpenCL Program Structure (2) 5. Create a program from the kernel source and build it program = clCreateProgramWithSource(context, 1, (const char**)&source_str, (const size_t*)&source_size, NULL); clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); where source_str consists of kernel program (kernel source) char *source_str = { "__kernel void hello_world(__global int *A) { \n" "// Get the index the current work-item \n" "int i = get_global_id(0); \n" "// Write this number in data massive \n" "A[i] = i; \n" "} \n"}; // The kernel file “helloWorld_kernel.cl” __kernel void hello_world(__global int *A) { // Get the index the current work-item int i = get_global_id(0); // Write this number in data massive A[i] = i; }

OpenCL Program Structure (3) 6. Create an OpenCL kernel kernel = clCreateKernel(program, "hello_world", NULL); 7. Create device memory buffer and copy the vector A from the host cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, nVector*sizeof(int), A, NULL); 8. Set the kernel arguments clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&a_mem_obj); 9. Execute the OpenCL kernel size_t global_item_size = nVector; // Number of work-items (threads) size_t local_item_size = 64; // Size of work-groups clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); 10. Get back data, copy buffer a_mem_obj to the local variable A clEnqueueReadBuffer(queue, a_mem_obj, CL_TRUE, 0, nVector*sizeof(int), A, 0, NULL, NULL);

How to Get Device Properties (1) //Get list of available platforms cl_platform_id platform_id = NULL; clGetPlatformIDs(1, &platform_id, NULL); //Get information about available platforms clGetPlatformInfo(platform_id, CL_PLATFORM_NAME, MAX_NAME_SIZE, name, NULL); clGetPlatformInfo(platform_id, CL_PLATFORM_VERSION, MAX_NAME_SIZE, name, NULL); //Get list of available devices on platforms clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, MAX_NUM_DEVICES, devices, &nDevices); //Get information about devices for(int iDevices = 0; iDevices < nDevices; ++iDevices) { clGetDeviceInfo(devices[iDevices], CL_DEVICE_NAME, MAX_NAME_SIZE, name, NULL); clGetDeviceInfo(devices[iDevices], CL_DRIVER_VERSION, MAX_NAME_SIZE, name, NULL); clGetDeviceInfo(devices[iDevices], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong),&entries, NULL); clGetDeviceInfo(devices[iDevices], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(cl_ulong),&entries, NULL); clGetDeviceInfo(devices[iDevices], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(cl_ulong),&entries, NULL); clGetDeviceInfo(devices[iDevices], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t),&p_size, NULL); }

How to Get Device Properties (2) nvcc findDevice.c -o findDevice -l OpenCL CL_PLATFORM_NAME = NVIDIA CUDA CL_PLATFORM_VERSION = OpenCL 1.1 CUDA Found 1 devices Device name = GeForce GT 635M Driver version = Global Memory (MB):2047 Local Memory (KB):32 Max clock (MHz) :950 Max Work Group Size:1024 Notebook ASUS: Intel Core i 7, nVidia GeForce GT 635M (2GB)

Примеры использования в научных вычислениях

Calculation of Integral (1) Calculate numerically integral of function f(x) on the interval [a,b]. The answer is I≈ The approximation error formula

Calculation of Integral (2) Discretization points: (2 in power 26) Calculation of the integral on CPU! Exact answer is Integral = Time = sec. double func(double x) { return pow( sin(x) * log(x+1.0), x/5.0 ) * pow(x, 0.5); } int main() { double h = (right - left) / (N - 1); for(int i = 0; i < N; i++) X[i] = left + i * h; double integral = 0.0; for(int i = 0; i < N-1; i++) { Y[i] = h * func(0.5 * (X[i] + X[i+1])); integral += Y[i]; } printf("Integral = %e\n", integral); return 0; }

Calculation of Integral (3) Discretization points: (2 in power 26) Calculation of the integral on GPU! Exact answer is Integral = Time = 2.78 sec. #pragma OPENCL EXTENSION cl_khr_fp64: enable __local double func(double x) { return pow( sin(x) * log(x+1.0), x/5.0 ) * pow(x, 0.5); } __kernel void calc_integral(__global double *h, __global double *X, __global double *Y) { //Get the index of the current element int i = get_global_id(0); //Do the operation Y[i] = h[0] * func(0.5 * (X[i] + X[i+1])); }

Calculation of Integral (4) Discretization points: (2 in power 26) Calculating Integral on CPU! Factor of complexity: 5 Exact answer is Integral = Time = sec. double func(double x) { return pow( sin(x) * log(x+1.0), x/5.0 ) * pow(x, 0.5); } const int fc = 5; // Factor of Complexity double tmp1; int a1 = 1, an = 5; double tmp2 = (double)a1; tmp2 += (double)an; tmp2 *= (double)an; tmp2 /= 2.0; double integral = 0; for(int i = 0; i < N-1; i++) { tmp1 = 0.0; for(int j = a1; j <= an; j++) tmp1 += (double)j * func(0.5 * (X[i] + X[i+1])); Y[i] = h * tmp1 / tmp2; integral += Y[i]; }

Calculation of Integral (5) Discretization points: (2 in power 26) Calculating Integral on GPU! Factor of complexity: 5 Exact answer is Integral = Time = 2.88 sec. #pragma OPENCL EXTENSION cl_khr_fp64: enable __local double func(double x) { return pow( sin(x) * log(x+1.0), x/5.0 ) * pow(x, 0.5); } __kernel void calc_integral(__global double *h, __global double *X, __global double *Y, __global int *fc) { int gid = get_global_id(0); __local double step; step = h[0]; __local double x; x = X[gid] + 0.5*step; __local int a1; a1 = 1; __local int an; an = fc[0]; // Factor of Complexity __local double tmp2 = (double)(a1 + an)*an / 2.0; __local double tmp1 = 0.0; int j; for(j = a1; j <= an; j=j+1) tmp1 = tmp1 + (double)j * step * func(x); Y[gid] = tmp1 / tmp2; } Calculation on GPU is around 13 times faster then on CPU.

Calculation of Integral (6) HybriLIT My laptop

Calculation of Integral (7)

Calculation of Integral (8)

Modeling of Heat Processes (1)

Modeling of Heat Processes (3) For example, typical sizes of computational grid (Nr x Nz): 631 x 401, 631 x – 0.12 cm, 1 and 2 – cm, 3 – cm

Modeling of Heat Processes (2)

Modeling of Heat Processes (3) For example, typical sizes fo computationalgrid (Nr x Nz) are 631 x 401, 631 x 601

Modeling of Heat Processes (4)

Modeling of Heat Processes (5)

Modeling of Heat Processes (6) 1. Initialization two missives Told and Tnew at host 2. Send data from host to device 3. Executing kernel: Tnew = Alg(Tnew, Told) 4. Getting Tnew from device 5. Told = Tnew 6. Repeat steps 2-5 when |Told – Tnew| < eps

Modeling of Heat Processes (7)

Modeling of Heat Processes (8)

Modeling of Heat Processes (9)

Modeling of Heat Processes (10)

Modeling of Heat Processes (11) Alexander Ayriyan, Jan Busa Jr., Eugeny E. Donets, Hovik Grigorian, Jan Pribis. Algorithm for Solving Non-Stationary Heat Conduction Problem for Design of a Technical Device. arXiv: [physics.comp-ph] (2014), 12 p. A. Ayriyan, J. Busa Jr., E.E. Donets, H. Grigorian, J. Pribis. Computational Scheme for Solving Heat Conduction Problem in Multilayer Cylindrical Domain // Bulletin of PFUR. Series Mathematics. Information Sciences. Physics (2015), no. 1, pp. 54–60.

Heterogeneous cluster at JINR-LIT