Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Similar presentations


Presentation on theme: "Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man."— Presentation transcript:

1 Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man

2 Outline  GPU & CUDA  Manually CUDA Coding  CUDA Library  FIR Realization  Auto Parallelizing Tool

3 GPU & CUDA  GPUs are massively multithreaded many core chips  Hundreds of scalar processors  Tens of thousands of concurrent threads  CUDA is the acronym for Compute Unified Device Architecture.  A parallel computing architecture developed by NVIDIA.  The computing engine in GPU.  CUDA can be accessible to software developers through industry standard programming languages. GeForce 8800 GTX (128 cores) Tesla C1060 (240 cores)

4 Processing Flow Serial code executes on the host while parallel code executes on the device.

5 Manually CUDA Coding  Find parallel kernels  Improve data reuse inside kernels to have better compute intensity  Access the memory in a GPU-friendly  Take advantage of complex memory hierarchy that make the GPU fast  Reduce the copy-in and copy-out transfers that pile up on the PCIe  Reduce memory usage in the GPU  Limit inter-block synchronizations

6 CUDA Libraries  Basic CUDA computation library  CUBLAS  CUFFT  GPULib  Advanced CUDA computation library  CULA  MAGMA  VSIPL

7 Basic libraries  CUBLAS provides a set of functions for basic vector and matrix operations  matrix‐vector copy, sort, dot product, Euclidean norm etc  CUFFT is the CUDA FFT library  cufftPlan1d(),cufftPlan2d(),cufftPlan3d()  GPULib provides a library of mathematical functions  addition, subtraction, multiplication, and division, as well as unary functions, including sin(), cos(), gamma(), and exp(),  interpolation, array reshaping, array slicing, and reduction operations

8 Advanced libraries  CULA: GPU Accelerated Linear Algebra  provide LAPACK (Linear Algebra PACKage) function on CUDA GPUs  MAGMA: Matrix Algebra on GPU and Multicore Architectures  develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures and "Multicore+GPU" systems

9 Advanced lib -VSIPL  VSIPL: Vector Image Signal Processing Library  Generalized matrix product  Fast FIR filtering  Correlation  Fast Fourier Transform  QR decomposition  Random number generation  Elementwise arithmetic, logical, and comparison operators, linear algebra procedures

10 Example // Allocate device memory for filter kernel Complex* d_filter_kernel; cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size)); // Copy host memory to device cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel, mem_size, cudaMemcpyHostToDevice)); // CUFFT plan cufftHandle plan; cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1)); // Transform signal and kernel cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));

11 FIR Realization on CUDA

12 t Threads

13 CUDA Demo (FIR) GPU: NVIDIA GeForce 8600 GT CPU: Intel Duo CPU 2.33G Software: Visual Studio 2005

14 CUDA Demo (FIR)

15 Auto-Parallelizing Tool  Par4All (open source environment): C and Fortran to CUDA C  PGI Accelerator: Fortran and C to CUDA C Auto-parallelizing Compiler  CAPS HMPP: C and Fortran to CUDA C Auto- parallelizing Compiler  Goose: C to CUDA C Auto-parallelizing Compiler  NOAA F2C : Fortran to CUDA C Translator

16  Par4All (open source environment): C and Fortran to CUDA C


Download ppt "Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man."

Similar presentations


Ads by Google