Download presentation
Presentation is loading. Please wait.
Published byAsher Chadwick Modified over 9 years ago
1
CUBLAS and CUSPARSE MVM Timing Gavin Harrison
2
SMVM Algorithm
3
NVIDIA Memory Hierarchy Global Memory: large/high latency. Shared Memory: shared cache for each set of processors. Constant/texture memory: read only in global memory + on chip cache. – Constant memory faster, but only one port. – Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.
4
Tuning SMVM for GPU (GT 280) Use multiple threads / row, use syncthreads and combine partial results. Access memory at stride. – Half warps access sequential addresses. – Allows for fewer memory reads from global memory. Align rows. – Also helps decrease memory reads from global memory. Use texture memory for input vector. – Input vector is reused. – Texture reads are cached, and benefit from spacial locality.
5
Improvements in Fermi (GTX 580) General L1/L2 cache structure. – L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). – L2 is 768 KB. Improved support for double precision floating point numbers. Added support for 32 bit integer multiplication. 32 SPs per SM.
6
CUSPARSE SMVM Performance
7
CUSPARSE SMVM Speedup Over OSKI (single precision)
8
CUBLAS MVM Performance
9
CUBLAS MVM Speedup over ATLAS
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.