MASS CUDA Performance Analysis and Improvement Ahmed Musazay Faculty Advisor: Dr. Munehiro Fukuda
MASS Multi Agent Spatial Simulation Allows non-computing specialists to parallelize simulations Concept of Place and Agent objects Three versions: C++, Java, CUDA High-level abstraction to non-computing specialists
CUDA C/C++ extension by NVidia A heterogeneous parallel programming interface. Host – CPU , and Device – GPU Functions executing on the GPU are called Kernel functions Take configuration parameters for number of threads -fast, but difficult to use, hard to tune up perf -utilize performance and also bring high level abstraction of mass
MASS-CUDA Current version – written by Nathaniel Hart for Master’s thesis Ported C++ version into current CUDA version Object oriented- allows users to extend Place and Agent objects Designed with intention of using multiple GPU cards Nate’s work- porting from mass to cuda
Problem Performance issues Difficult to tune performance Goal of project: Understand MASS Library and how it works Write unit tests to find where performance issues occur Propose solutions that can be implemented to increase performance of MASS CUDA
Heat2D Fourier’s heat equation Place objects – Metal Simulation describing spread of heat in a given region over period of time Place objects – Metal Ran at four different sizes 250x250, 500x500, 1000x1000, 2000x2000
Test Case: Running Heat2D - Primitive Array Heat2D simulation using array of doubles No objects created to contain information as opposed to MASS Simulation functions written as kernel functions
Results
Proposed Solution Store all data in MASS as user-defined primitive type arrays Index mapping to unique element Pros Fast accesses Can run larger simulations, requiring less heap memory overhead Cons User programmability
Test Case: Running Heat2D - Place objects Ran simulation with same objects used in MASS, without using library function calls Metal & MetalState derived from Library classes containing same memory and internal functions Simulation functions re-written in CUDA as kernel functions
Results
Proposed Solution Remove unnecessary functionality that may be slowing library down Excessive memory transfers between host and device Partitioning logic Pros Can work on adding only a single feature of library at time, making sure meeting performance standard More computation spent on actual simulation rather than management Cons Scalability of library will be missing early in development
Test Case: Running Heat2D – Coalesced Accesses Ran simulation using primitive values, but taking advantage of coalesced memory accesses Kernel functions taking array parameters as native dimension – 2D array cudaMallocPitch(), cudaMalloc3D()
Results
Proposed Solution Let MASS run the simulation in its native dimension (1D, 2D, 3D) Pros Faster memory accesses, increasing performance Cons Extra overhead of determining dimensions to run function as Will only be able to natively run up to 3 dimensions
Conclusion Removing unused features, implementing one feature at a time Coalesced memory accesses – using native array dimensions Using primitive arrays Consider : Shared memory
Final Words Relevant courses: Special thanks to: CSS 430 Operating Systems CSS 422 Hardware and Computer Organization Special thanks to: Dr. Fukuda Nathaniel Hart