Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul CRA-W/CDC Careers in High Performance Systems (CHiPS) Mentoring Workshop July National Center for Supercomputing Applications (NCSA) at University of Illinois at Urbana-Champaign (UIUC)
CHiPS - Team Programming Project Some words about me ▪ 4 th year Ph.D student ▪ Born and raised in South Korea ▪ 34 years old (never too late to learn) ▪ B.S. in mechanical engineering and M.S. in computer science ▪ Full time engineer at Samsung Electronics for 3 years ▪ GPGPU ▪ Internship at AMD and fellowship from AMD ▪ Happy
CHiPS - Team Programming Project Goals ▪ Understand General Purpose Computing on GPU (a.k.a. GPGPU) ▪ Experience CUDA GPU programming ▪ Understand how massively multi-threaded parallel programming works ▪ Think about solving a problem in a parallel fashion ▪ Experience the tremendous computational power of GPU ▪ Experience the challenges in efficient parallel programming
CHiPS - Team Programming Project Outlines ▪ Application 1: Image Rotation ▪ Introduction and Design (15 min) ▪ Preparation (5 min) ▪ Installing a skeleton code, compile test, image view test ▪ Hands-on Programming (30 min) ▪ Replace ??? with your own CUDA code ▪ Application 2: Histogram ▪ Introduction and Design (15 min) ▪ Preparation (5 min) ▪ Installing a skeleton code, compile test ▪ Hands-on Programming (40 min) ▪ Replace ??? with your own CUDA code ▪ Conclusion
CHiPS - Team Programming Project Application 1: Image Rotation - Introduction - Original Input ImageRotated Output Image ▪ Rotate an image by a given angle ▪ A basic feature in image processing applications
CHiPS - Team Programming Project ▪ What the application does: Step 1. Compute a new location according to the rotation angle (trigonometric computation) Step 2. Read the pixel value of original location Step 3. Write the pixel value to the new location computed at Step 1 ▪ Create the same number of threads as the number of pixels ▪ Each thread takes care of moving one pixel ▪ Our goals are ▪ To understand how to use GPU for data parallelism ▪ To know how to map threads to data Application 1: Image Rotation - Introduction -
CHiPS - Team Programming Project Application 1: Image Rotation - Design - Thread Block (0, 0) Thread Block (0, 1) Thread Block (0, 63) Thread Block (63, 0) Thread Block (63, 63) 512 Treads Mapping
CHiPS - Team Programming Project 1. Deploy the skeleton code in the proper directory ~]$ cp /tmp/projects.tar./ ~]$ cp /tmp/cuda.pdf./ ~]$ tar -xf projects.tar 2. Request a cluster node for interactive use for 2 hours ~]$ qsub -I -l walltime=02:00:00 3. Compile ~]$ cd PROJECTS/projects/ImageRotation ~]$ make clean ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 4. Execute ~]$./ImageRotation 5. Convert image from “pgm” to “jpg” format ~]$ convert data/lena_out.pgm data/lena_out.jpg 6. Download “lena_out.jpg” to your laptop to view it Application 1: Image Rotation - Preparation - Download for your future reference
CHiPS - Team Programming Project ▪ Replace ??? in the skeleton code with your own CUDA code ▪ Refer to the hints and comments in skeleton code ▪ Talk to me if you have any questions or are done ▪ Try to finish by 2:30 pm ▪ Help others if you finish early Application 1: Image Rotation - Hands-on Programming -
CHiPS - Team Programming Project Application 2: Histogram - Introduction - Input Image Output Histogram 0 (black) 255 (white) y-axis: Number of Pixels x-axis: Intensity ▪ Shows the frequency of occurrence of the intensity value of each pixel ▪ A commonly used analysis tool in image processing and data mining applications
CHiPS - Team Programming Project ▪ Serial implementation looks like ▪ Access to data[] is sequential but access to histogram[] is random depending on the value, therefore, ▪ We will use a fast shared memory to store per-block sub- histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does Application 2: Histogram - Introduction - data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin
CHiPS - Team Programming Project Application 2: Histogram - Design - ▪ The structure of shared memory would look like the follow ▪ Notice that shared memory is per thread block and limited data[DATA_COUNT] Shared Memory s_hist[] 64 data elements
CHiPS - Team Programming Project Application 2: Histogram - Design - ▪ Merging per-thread histogram into per-block histogram Shared Memory s_hist[] per block d_result[] # of thread blocks BIN_COUNT = 64 THREAD_N = 192 BIN_COUNT final histogram
CHiPS - Team Programming Project 1. Compile ~]$ cd PROJECTS/projects/Histogram ~]$ make clean ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 2. Execute ~]$./Histogram 4. Check output message “*** TEST FAILED”: something wrong “*** TEST PASSED”: you got it Application 1: Image Rotation - Preparation -
CHiPS - Team Programming Project Application 1: Histogram - Hands-on Programming - ▪ Replace ??? in the skeleton code with your own CUDA code ▪ Refer to the hints and comments in skeleton code ▪ Talk to me if you have any questions or are done ▪ Try to finish by 3:30 pm ▪ Help others if you finish early
CHiPS - Team Programming Project Conclusions ▪ What we’ve learned throughout the two projects ▪ Understood a massive parallel computing on GPU ▪ Experienced what CUDA programming looks like ▪ Understood how to explicitly program hardware resources ▪ Understood the importance and challenges in parallel programming ▪ Experienced solving problem in massively parallel fashion ▪ GPU is the platform of choice for data-parallel computationally- intensive applications ▪ In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance
CHiPS - Team Programming Project Thank you!