Download presentation
Presentation is loading. Please wait.
Published byKellie Cooper Modified over 9 years ago
2
1)Leverage raw computational power of GPU Magnitude performance gains possible
3
2)Leverage maturation of GPU HW and SW Dedicated fixed 3D accelerators Programmable gfx pipeline (shaders) General computing (nVidia G80) HW: Assembly code Shader programming languages (Cg/HLSL) General programming languages (CUDA) SW: 1995 - 2000 2000 - 2005 2006 - ---
4
Nanoscale Molecular Dynamics (NAMD) University of Illinois, Urbana-Champaign tools for simulating and visualizing biomolecular processes Yield 3.5x – 8x performance gains
5
Develop a high performance library of core computational methods using the GPU Library level BLAS (Basic Linear Algebra Subprograms) numerical methods image processing kernels Application level port LONI algorithms
6
G80 chipset: nVidia 8800 GTX 680 million transistors Intel Core 2 (290 million) 128 micro-processors 16 multi-processor units @ 1.3 GHz 8 processors per multi-processor unit Device memory: 768 MB High performance parallel architecture On-chip shared memory (16 KB) Texture cache (8 KB) Constant memory (64 KB) and cache (8 KB)
8
Compatible with all cards with CUDA driver Linux / Windows Mobile (GeForce 8M), desktop (GeForce 8), server (Quadro) Scalable to multi-GPUs nVidia SLI Workstation cluster (nVidia Tesla) 1.5 GB Dedicated Memory 2 or 4 G80 GPUs (256 or 512 micro-processors) Attractive cost-to-performance ratio nVidia 8800 GTX: $550 nVidia Tesla: $7500 - $12,000
9
nVidia CUDA is 1 st generation Not all algorithms scale well to GPU Host memory to device memory bottleneck Single-precision floating point Cross-GPU development currently not available
10
TaskTime a) Identify computational methods to implement b) Evaluate if scalable to GPU 2 - 4 weeks Experimentation/Implementation3 - 4 months Develop prototypeFeb 2008
11
Basic definitions BLOCK = conceptual computational node Max number = 65535 Optimal if # of blocks multiple of # of multiprocessors (16) Each BLOCK runs a number of threads Max threads per block = 512 Optimal if # threads multiple of warp size (32) Pivot-divide for 3D volume data Matrix pivot-divide applied to each slice independently Mapped each slice to “block” (NUMBLOCKS = N) Each thread in block handles one row in slice (NUMTHREADS = N)
14
As long as no synchronization among slices, scales well to GPU Concurrent read of other slices should be possible Host to Device latency 1GB/s measured (2GB/s reported) PCIe settings? Need Investigating: NUMBLOCKS and multiprocessor count? Fine-tune number of slices per block? CUDA scheduler seems to handle it well when NUMBLOCKS = N Scaling issues N > NUMTHREADS ? Will we ever hit BLOCK limit?
15
t( total ) = t( mem ) + t( compute ) GPU t(mem) = host to device transfer t(compute) = kernel time CPU t(mem) = memcpy() t(compute) = loop time Parameters for N=16…256, BLOCKS = 256 for N=272…512, BLOCKS=512
16
Host to Device memory bottleneck Pageable vs Pinned memory allocation 2x faster with pinned
20
Single Instruction Multiple Data Model (SIMD) Less synchronization, higher performance v1.0 – no sync among blocks High Arithmetic Intensity Arithmetic Intensity = Arithmetic OPs/Memory Ops Computations can overlap with memory operations
21
Memory Operations highest latency Shared memory Fast as accessing register with no bank conflicts Limited to 16KB Texture memory Cached from device memory Optimized for 2D spatial locality Built-in filtering/interpolation methods Read packed data in one operation (ie: RGBA) Constant memory Cached from device memory Fast as register if all threads read same address Device memory Uncached, very slow Faster if byte aligned and coalesced into single contiguous access
22
Arithmetic Operations 4 clock cycles for float (+,*,*+), int (+) 16 clock cycles for 32-bit int mul (4 cycles for 24-bit) 36 clock cycles for float division (int division and modulo very costly) v1.0 – only floats (double converted to float) Atomic operations (v1.1 only) Provides locking mechanisms
23
Minimize host-to-device memory transfers Minimize device memory access Optimize with byte alignment, coalescing Minimize execution divergence Minimize branching in kernel Unroll loops Make high use of shared memory Must correctly stripe data to avoid bank conflicts For image processing tasks, texture memory may be more efficient # threads per block = multiple( 32 ) # blocks = ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.