CUBLAS and CUSPARSE MVM Timing Gavin Harrison
SMVM Algorithm
NVIDIA Memory Hierarchy Global Memory: large/high latency. Shared Memory: shared cache for each set of processors. Constant/texture memory: read only in global memory + on chip cache. – Constant memory faster, but only one port. – Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.
Tuning SMVM for GPU (GT 280) Use multiple threads / row, use syncthreads and combine partial results. Access memory at stride. – Half warps access sequential addresses. – Allows for fewer memory reads from global memory. Align rows. – Also helps decrease memory reads from global memory. Use texture memory for input vector. – Input vector is reused. – Texture reads are cached, and benefit from spacial locality.
Improvements in Fermi (GTX 580) General L1/L2 cache structure. – L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). – L2 is 768 KB. Improved support for double precision floating point numbers. Added support for 32 bit integer multiplication. 32 SPs per SM.
CUSPARSE SMVM Performance
CUSPARSE SMVM Speedup Over OSKI (single precision)
CUBLAS MVM Performance
CUBLAS MVM Speedup over ATLAS