Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS)

Slides:

Advertisements

Similar presentations

Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura.

Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Cambodia-India Entrepreneurship Development Centre - : :.... :-:-

SE320: Introduction to Computer Games Week 8: Game Programming Gazihan Alankus.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Early Adopter Introduction to Parallel Computing: Research Intensive University: 4 th Year Elective Bo Hong Electrical and Computer Engineering Georgia.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao.

QCAdesigner – CUDA HPPS project

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

SIFT DESCRIPTOR K Wasif Mrityunjay

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

General Purpose computing on Graphics Processing Units

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Computer Graphics Lecture 1 Introduction to Computer Graphics

Progressive Clustering of Big Data with GPU Acceleration and Visualization Jun Wang1, Eric Papenhausen1, Bing Wang1, Sungsoo Ha1, Alla Zelenyuk2, and Klaus.

CS/EE 217 – GPU Architecture and Parallel Programming

CSC 4250 Computer Architectures

Example: Card Game Create a class called “Card”

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Operation System Program 4

IMAGE MOSAICING MALNAD COLLEGE OF ENGINEERING

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Graphics Processing Unit

Presentation transcript:

Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul CRA-W/CDC Careers in High Performance Systems (CHiPS) Mentoring Workshop July National Center for Supercomputing Applications (NCSA) at University of Illinois at Urbana-Champaign (UIUC)

CHiPS - Team Programming Project Some words about me ▪ 4 th year Ph.D student ▪ Born and raised in South Korea ▪ 34 years old (never too late to learn) ▪ B.S. in mechanical engineering and M.S. in computer science ▪ Full time engineer at Samsung Electronics for 3 years ▪ GPGPU ▪ Internship at AMD and fellowship from AMD ▪ Happy

CHiPS - Team Programming Project Goals ▪ Understand General Purpose Computing on GPU (a.k.a. GPGPU) ▪ Experience CUDA GPU programming ▪ Understand how massively multi-threaded parallel programming works ▪ Think about solving a problem in a parallel fashion ▪ Experience the tremendous computational power of GPU ▪ Experience the challenges in efficient parallel programming

CHiPS - Team Programming Project Outlines ▪ Application 1: Image Rotation ▪ Introduction and Design (15 min) ▪ Preparation (5 min) ▪ Installing a skeleton code, compile test, image view test ▪ Hands-on Programming (30 min) ▪ Replace ??? with your own CUDA code ▪ Application 2: Histogram ▪ Introduction and Design (15 min) ▪ Preparation (5 min) ▪ Installing a skeleton code, compile test ▪ Hands-on Programming (40 min) ▪ Replace ??? with your own CUDA code ▪ Conclusion

CHiPS - Team Programming Project Application 1: Image Rotation - Introduction - Original Input ImageRotated Output Image ▪ Rotate an image by a given angle ▪ A basic feature in image processing applications

CHiPS - Team Programming Project ▪ What the application does: Step 1. Compute a new location according to the rotation angle (trigonometric computation) Step 2. Read the pixel value of original location Step 3. Write the pixel value to the new location computed at Step 1 ▪ Create the same number of threads as the number of pixels ▪ Each thread takes care of moving one pixel ▪ Our goals are ▪ To understand how to use GPU for data parallelism ▪ To know how to map threads to data Application 1: Image Rotation - Introduction -

CHiPS - Team Programming Project Application 1: Image Rotation - Design - Thread Block (0, 0) Thread Block (0, 1) Thread Block (0, 63) Thread Block (63, 0) Thread Block (63, 63) 512 Treads Mapping

CHiPS - Team Programming Project 1. Deploy the skeleton code in the proper directory ~]$ cp /tmp/projects.tar./ ~]$ cp /tmp/cuda.pdf./ ~]$ tar -xf projects.tar 2. Request a cluster node for interactive use for 2 hours ~]$ qsub -I -l walltime=02:00:00 3. Compile ~]$ cd PROJECTS/projects/ImageRotation ~]$ make clean ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 4. Execute ~]$./ImageRotation 5. Convert image from “pgm” to “jpg” format ~]$ convert data/lena_out.pgm data/lena_out.jpg 6. Download “lena_out.jpg” to your laptop to view it Application 1: Image Rotation - Preparation - Download for your future reference

CHiPS - Team Programming Project ▪ Replace ??? in the skeleton code with your own CUDA code ▪ Refer to the hints and comments in skeleton code ▪ Talk to me if you have any questions or are done ▪ Try to finish by 2:30 pm ▪ Help others if you finish early Application 1: Image Rotation - Hands-on Programming -

CHiPS - Team Programming Project Application 2: Histogram - Introduction - Input Image Output Histogram 0 (black) 255 (white) y-axis: Number of Pixels x-axis: Intensity ▪ Shows the frequency of occurrence of the intensity value of each pixel ▪ A commonly used analysis tool in image processing and data mining applications

CHiPS - Team Programming Project ▪ Serial implementation looks like ▪ Access to data[] is sequential but access to histogram[] is random depending on the value, therefore, ▪ We will use a fast shared memory to store per-block sub- histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does Application 2: Histogram - Introduction - data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin

CHiPS - Team Programming Project Application 2: Histogram - Design - ▪ The structure of shared memory would look like the follow ▪ Notice that shared memory is per thread block and limited data[DATA_COUNT] Shared Memory s_hist[] 64 data elements

CHiPS - Team Programming Project Application 2: Histogram - Design - ▪ Merging per-thread histogram into per-block histogram Shared Memory s_hist[] per block d_result[] # of thread blocks BIN_COUNT = 64 THREAD_N = 192 BIN_COUNT final histogram

CHiPS - Team Programming Project 1. Compile ~]$ cd PROJECTS/projects/Histogram ~]$ make clean ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 2. Execute ~]$./Histogram 4. Check output message “*** TEST FAILED”: something wrong “*** TEST PASSED”: you got it Application 1: Image Rotation - Preparation -

CHiPS - Team Programming Project Application 1: Histogram - Hands-on Programming - ▪ Replace ??? in the skeleton code with your own CUDA code ▪ Refer to the hints and comments in skeleton code ▪ Talk to me if you have any questions or are done ▪ Try to finish by 3:30 pm ▪ Help others if you finish early

CHiPS - Team Programming Project Conclusions ▪ What we’ve learned throughout the two projects ▪ Understood a massive parallel computing on GPU ▪ Experienced what CUDA programming looks like ▪ Understood how to explicitly program hardware resources ▪ Understood the importance and challenges in parallel programming ▪ Experienced solving problem in massively parallel fashion ▪ GPU is the platform of choice for data-parallel computationally- intensive applications ▪ In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance

CHiPS - Team Programming Project Thank you!