Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Computer Abstractions and Technology

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

A many-core GPU architecture.. Price, performance, and evolution.

GPU Computing with CUDA as a focus Christie Donovan.

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

GPU Architectural Considerations for Cellular Automata Programming A comparison of performance between a x86 CPU and nVidia Graphics Card Stephen Orchowski,

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

CS 732: Advance Machine Learning

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Parallel Programming Models

Computer Engg, IIT(BHU)

GPU Architecture and Its Application

CS427 Multicore Architecture and Parallel Computing

Architecture & Organization 1

CS 179 Lecture 14.

Architecture & Organization 1

NVIDIA Fermi Architecture

Graphics Processing Unit

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer Engineering – Prof. Olivera Notaros

Project Goals: To develop parallel versions of applications that will run on a graphics card and measure the performance. Project Goals: To develop parallel versions of applications that will run on a graphics card and measure the performance. – Started with a simple Matrix Multiply program. – We intend to develop at least one or two additional applications and also to pursue an analysis of hardware optimizations. – Develop a process for tuning applications & hardware that other developers can use more easily.

Tyler Drake – Computer Science major Tyler Drake – Computer Science major Robert Wrisley – Computer Science/Computer Engineering dual major Robert Wrisley – Computer Science/Computer Engineering dual major Kyle Von Koepping – Electrical Engineering major Kyle Von Koepping – Electrical Engineering major Justin Walsh – Computer Science/Computer Engineering dual major Justin Walsh – Computer Science/Computer Engineering dual major Shared coding responsibilities Shared coding responsibilities – Enables comparison and greater understanding for all team members – Possibly divide responsibilities for the second half of the project

Transistor densities on single-core processors were doubling approximately every 18 months. Transistor densities on single-core processors were doubling approximately every 18 months. This trend has remained valid since first observed in 1965 and is expected to hold for several more years. This trend has remained valid since first observed in 1965 and is expected to hold for several more years. This natural trend had become the standard goal for hardware companies. This natural trend had become the standard goal for hardware companies.

There is an ultimate limit to Moore’s law. There is an ultimate limit to Moore’s law. Transistors will soon reach sizes of atomic level. Transistors will soon reach sizes of atomic level. Moore’s law does not apply to Random Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall) Moore’s law does not apply to Random Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall) Redesign of processor architecture isn’t driven directly by Moore’s Law, but by the fact that these and other factors have not kept up with this growth rate. Redesign of processor architecture isn’t driven directly by Moore’s Law, but by the fact that these and other factors have not kept up with this growth rate.

CPU or multiple CPU’s are not the only processors found on a personal computer CPU or multiple CPU’s are not the only processors found on a personal computer The graphics card has a graphics processing unit (GPU). The graphics card has a graphics processing unit (GPU). The GPU is specifically designed to render 3D models onto a 2D display The GPU is specifically designed to render 3D models onto a 2D display Designed for floating point computation with a highly parallel architecture. Designed for floating point computation with a highly parallel architecture.

Engineers have begun to exploit the highly parallel architecture of the GPU for general applications. Engineers have begun to exploit the highly parallel architecture of the GPU for general applications. Graphics companies encourage general purpose computing on the GPU (GPGPU). Graphics companies encourage general purpose computing on the GPU (GPGPU). Nvidia has developed CUDA (Compute Unified Device Architecture). Nvidia has developed CUDA (Compute Unified Device Architecture). Based on the C language programmers can easily shift to developing on the GPU Based on the C language programmers can easily shift to developing on the GPU

What We Have Done So Far

Learning about CUDA Learning about CUDA – NVIDIA CUDA guides – Lecture slides from University of Illinois, Urbana-Champaign – Papers from various academic groups University of Illinois, Urbana-Champaign University of Illinois, Urbana-Champaign Tokyo Institute of Technology Tokyo Institute of Technology University of California at Berkeley University of California at Berkeley Learning to write parallel programs in CS475 using MPI & OpenMP Learning to write parallel programs in CS475 using MPI & OpenMP Writing simple programs using CUDA and observing performance Writing simple programs using CUDA and observing performance – Matrix Multiply

Results Results Achieved 131 Gigaflops/sec on a GTX280 with N = GTX 280 peak is 933 Gigaflops/sec. Achieved 131 Gigaflops/sec on a GTX280 with N = GTX 280 peak is 933 Gigaflops/sec. Optimizations Optimizations Tiling the result matrix into smaller sub- matrices and having each thread block compute a sub-matrix will reduce amount of data needed to be loaded by each thread block. Tiling the result matrix into smaller sub- matrices and having each thread block compute a sub-matrix will reduce amount of data needed to be loaded by each thread block. This helps to reduce memory latency. This helps to reduce memory latency.

Memory Memory – Must allocate memory on the graphics card from the main program being run on the CPU – Memory for the graphics card is explicitly managed by the programmer An “extension” to C, not a separate language An “extension” to C, not a separate language – Similar to MPI, OpenMP, etc.

Increasing problem complexity  Some are no longer “Pleasantly Parallel”  Higher degree of kernel analysis  Moving to more dynamic programs

Additional programs being written for the GPU include: Additional programs being written for the GPU include: – Scan: Matrix computation where the ith index is the sum of the previous i-1 indices! – Knapsack: profit maximization given a capacity and list of items with their weight & profit – Matrix Multiply for still larger matrices – Triangular Matrix Multiplication

Mandelbrot Set  Pleasantly parallel, familiar  Easily scalable

Ray Tracing  Very computationally intensive  Feasible for non- realtime computations  Very dynamic, due to recursion  High degree of realism

Examples of images generated by Ray Tracing

Hidden Markov Models  Clear parallelism  Wide range of applications

Uses of Hidden Markov Models

To develop a more complex application for the GPU and optimize the performance To develop a more complex application for the GPU and optimize the performance To analyze hardware optimizations and evaluate the performance gains To analyze hardware optimizations and evaluate the performance gains Develop a process for future programmers that will give them the best performance increases with the minimum development effort Develop a process for future programmers that will give them the best performance increases with the minimum development effort Please Note: These goals are tentative and subject to change. Please Note: These goals are tentative and subject to change.

Moore’s Law now being applied to processors per core instead of transistors per processor. Moore’s Law now being applied to processors per core instead of transistors per processor. Multi-core machines offer the next generation of performance enhancements… but they are already here! Multi-core machines offer the next generation of performance enhancements… but they are already here! GPUs provide massively parallel architectures that programmers can take advantage of to see phenomenal performance gains. GPUs provide massively parallel architectures that programmers can take advantage of to see phenomenal performance gains.

Learning to use the CUDA library and some of the nuances. Learning to use the CUDA library and some of the nuances. Have gotten good performance on Matrix- Multiply attempts. Have gotten good performance on Matrix- Multiply attempts. Also completing CUDA versions of Scan and Knapsack problems. Also completing CUDA versions of Scan and Knapsack problems. Move on to a more complex application. Move on to a more complex application. Researching hardware optimizations that can further enhance performance on GPUs. Researching hardware optimizations that can further enhance performance on GPUs. Develop a combined approach for future applications programmers to follow. Develop a combined approach for future applications programmers to follow.

$50 spent for a graphics card that is CUDA compatible. $50 spent for a graphics card that is CUDA compatible. We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards. We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards. – This provided us free access to a consistent build for all of us to run our code and sample code on. We don’t project any major costs next semester, except perhaps for some materials for our E-Days presentation. We don’t project any major costs next semester, except perhaps for some materials for our E-Days presentation.