MASS CUDA Performance Analysis and Improvement

Slides:



Advertisements
Similar presentations
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Speed, Accurate and Efficient way to identify the DNA.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Contemporary Languages in Parallel Computing Raymond Hummel.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
QCAdesigner – CUDA HPPS project
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Parallel Computing With High Performance Computing Clusters (HPCs) By Jeremy Cathey.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
MASS C++ Updates JENNIFER KOWALSKY, What is MASS? Multi-Agent Spatial Simulation A library for parallelizing simulations and data analysis Uses.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Introduction to Operating Systems Concepts
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Productive Performance Tools for Heterogeneous Parallel Computing
Analysis of Sparse Convolutional Neural Networks
Current Generation Hypervisor Type 1 Type 2.
Advanced Operating Systems CIS 720
Fast Number Crunching Fast Time to Market with Scala
The Mach System Sri Ramkrishna.
Operating Systems (CS 340 D)
COMBINED PAGING AND SEGMENTATION
Direct Attached Storage and Introduction to SCSI
Chapter 4: Threads 羅習五.
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Map-Scan Node Accelerator for Big-Data
Improving java performance using Dynamic Method Migration on FPGAs
Operating Systems (CS 340 D)
Speedup over Ji et al.'s work
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Direct Attached Storage and Introduction to SCSI
Parallel NetCDF + MASS Development
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Booting Up 15-Nov-18 boot.ppt.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Memory Management Tasks
What is Concurrent Programming?
Chapter 2: The Linux System Part 5
Operating System Introduction.
Chapter 4: Threads & Concurrency
Ray Tracing on Programmable Graphics Hardware
Graphics Processing Unit
Virtualization Dr. S. R. Ahmed.
6- General Purpose GPU Programming
Presentation transcript:

MASS CUDA Performance Analysis and Improvement Ahmed Musazay Faculty Advisor: Dr. Munehiro Fukuda

MASS Multi Agent Spatial Simulation Allows non-computing specialists to parallelize simulations Concept of Place and Agent objects Three versions: C++, Java, CUDA High-level abstraction to non-computing specialists

CUDA C/C++ extension by NVidia A heterogeneous parallel programming interface. Host – CPU , and Device – GPU Functions executing on the GPU are called Kernel functions Take configuration parameters for number of threads -fast, but difficult to use, hard to tune up perf -utilize performance and also bring high level abstraction of mass

MASS-CUDA Current version – written by Nathaniel Hart for Master’s thesis Ported C++ version into current CUDA version Object oriented- allows users to extend Place and Agent objects Designed with intention of using multiple GPU cards Nate’s work- porting from mass to cuda

Problem Performance issues Difficult to tune performance Goal of project: Understand MASS Library and how it works Write unit tests to find where performance issues occur Propose solutions that can be implemented to increase performance of MASS CUDA

Heat2D Fourier’s heat equation Place objects – Metal Simulation describing spread of heat in a given region over period of time Place objects – Metal Ran at four different sizes 250x250, 500x500, 1000x1000, 2000x2000

Test Case: Running Heat2D - Primitive Array Heat2D simulation using array of doubles No objects created to contain information as opposed to MASS Simulation functions written as kernel functions

Results

Proposed Solution Store all data in MASS as user-defined primitive type arrays Index mapping to unique element Pros Fast accesses Can run larger simulations, requiring less heap memory overhead Cons User programmability

Test Case: Running Heat2D - Place objects Ran simulation with same objects used in MASS, without using library function calls Metal & MetalState derived from Library classes containing same memory and internal functions Simulation functions re-written in CUDA as kernel functions

Results

Proposed Solution Remove unnecessary functionality that may be slowing library down Excessive memory transfers between host and device Partitioning logic Pros Can work on adding only a single feature of library at time, making sure meeting performance standard More computation spent on actual simulation rather than management Cons Scalability of library will be missing early in development

Test Case: Running Heat2D – Coalesced Accesses Ran simulation using primitive values, but taking advantage of coalesced memory accesses Kernel functions taking array parameters as native dimension – 2D array cudaMallocPitch(), cudaMalloc3D()

Results

Proposed Solution Let MASS run the simulation in its native dimension (1D, 2D, 3D) Pros Faster memory accesses, increasing performance Cons Extra overhead of determining dimensions to run function as Will only be able to natively run up to 3 dimensions

Conclusion Removing unused features, implementing one feature at a time Coalesced memory accesses – using native array dimensions Using primitive arrays Consider : Shared memory

Final Words Relevant courses: Special thanks to: CSS 430 Operating Systems CSS 422 Hardware and Computer Organization Special thanks to: Dr. Fukuda Nathaniel Hart