Presentation is loading. Please wait.

Presentation is loading. Please wait.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Similar presentations


Presentation on theme: "1)Leverage raw computational power of GPU  Magnitude performance gains possible."— Presentation transcript:

1

2 1)Leverage raw computational power of GPU  Magnitude performance gains possible

3 2)Leverage maturation of GPU HW and SW Dedicated fixed 3D accelerators Programmable gfx pipeline (shaders) General computing (nVidia G80) HW: Assembly code Shader programming languages (Cg/HLSL) General programming languages (CUDA) SW: 1995 - 2000 2000 - 2005 2006 - ---

4  Nanoscale Molecular Dynamics (NAMD) University of Illinois, Urbana-Champaign  tools for simulating and visualizing biomolecular processes  Yield 3.5x – 8x performance gains

5  Develop a high performance library of core computational methods using the GPU  Library level  BLAS (Basic Linear Algebra Subprograms)  numerical methods  image processing kernels  Application level  port LONI algorithms

6  G80 chipset: nVidia 8800 GTX  680 million transistors  Intel Core 2 (290 million)  128 micro-processors  16 multi-processor units @ 1.3 GHz  8 processors per multi-processor unit  Device memory: 768 MB  High performance parallel architecture  On-chip shared memory (16 KB)  Texture cache (8 KB)  Constant memory (64 KB) and cache (8 KB)

7

8  Compatible with all cards with CUDA driver  Linux / Windows  Mobile (GeForce 8M), desktop (GeForce 8), server (Quadro)  Scalable to multi-GPUs  nVidia SLI  Workstation cluster (nVidia Tesla)  1.5 GB Dedicated Memory  2 or 4 G80 GPUs (256 or 512 micro-processors)  Attractive cost-to-performance ratio  nVidia 8800 GTX: $550  nVidia Tesla: $7500 - $12,000

9  nVidia CUDA is 1 st generation  Not all algorithms scale well to GPU  Host memory to device memory bottleneck  Single-precision floating point  Cross-GPU development currently not available

10 TaskTime a) Identify computational methods to implement b) Evaluate if scalable to GPU 2 - 4 weeks Experimentation/Implementation3 - 4 months Develop prototypeFeb 2008

11  Basic definitions  BLOCK = conceptual computational node  Max number = 65535  Optimal if # of blocks multiple of # of multiprocessors (16)  Each BLOCK runs a number of threads  Max threads per block = 512  Optimal if # threads multiple of warp size (32)  Pivot-divide for 3D volume data  Matrix pivot-divide applied to each slice independently  Mapped each slice to “block” (NUMBLOCKS = N)  Each thread in block handles one row in slice (NUMTHREADS = N)

12

13

14  As long as no synchronization among slices, scales well to GPU  Concurrent read of other slices should be possible  Host to Device latency  1GB/s measured (2GB/s reported)  PCIe settings?  Need Investigating:  NUMBLOCKS and multiprocessor count?  Fine-tune number of slices per block?  CUDA scheduler seems to handle it well when NUMBLOCKS = N  Scaling issues  N > NUMTHREADS ?  Will we ever hit BLOCK limit?

15  t( total ) = t( mem ) + t( compute )  GPU  t(mem) = host to device transfer  t(compute) = kernel time  CPU  t(mem) = memcpy()  t(compute) = loop time  Parameters  for N=16…256, BLOCKS = 256  for N=272…512, BLOCKS=512

16  Host to Device memory bottleneck  Pageable vs Pinned memory allocation  2x faster with pinned

17

18

19

20  Single Instruction Multiple Data Model (SIMD)  Less synchronization, higher performance  v1.0 – no sync among blocks  High Arithmetic Intensity  Arithmetic Intensity = Arithmetic OPs/Memory Ops  Computations can overlap with memory operations

21  Memory Operations highest latency  Shared memory  Fast as accessing register with no bank conflicts  Limited to 16KB  Texture memory  Cached from device memory  Optimized for 2D spatial locality  Built-in filtering/interpolation methods  Read packed data in one operation (ie: RGBA)  Constant memory  Cached from device memory  Fast as register if all threads read same address  Device memory  Uncached, very slow  Faster if byte aligned and coalesced into single contiguous access

22  Arithmetic Operations  4 clock cycles for float (+,*,*+), int (+)  16 clock cycles for 32-bit int mul (4 cycles for 24-bit)  36 clock cycles for float division  (int division and modulo very costly)  v1.0 – only floats (double converted to float)  Atomic operations (v1.1 only)  Provides locking mechanisms

23  Minimize host-to-device memory transfers  Minimize device memory access  Optimize with byte alignment, coalescing  Minimize execution divergence  Minimize branching in kernel  Unroll loops  Make high use of shared memory  Must correctly stripe data to avoid bank conflicts  For image processing tasks, texture memory may be more efficient  # threads per block = multiple( 32 )  # blocks = ?


Download ppt "1)Leverage raw computational power of GPU  Magnitude performance gains possible."

Similar presentations


Ads by Google