Download presentation
Presentation is loading. Please wait.
Published byLorin Pope Modified over 9 years ago
1
GPU in HPC Scott A. Friedman friedman@ats.ucla.edu ATS Research Computing Technologies
2
IDRE GPU Lunch First of all… Double Precision is coming! –GPU: late 07 or early 08 (nvidia) Will be half speed – word on street At G80 speed, that equals 175Gflop –Cell HPC: summer 08 First appears in LANL Roadrunner 5x increase to 100Gflop
3
IDRE GPU Lunch Hardware Remember! –GPUs are for graphics (graphics processing unit) Think data parallelism! Must hide memory latency –Lots of computation – ‘arithmetic intensity’ –Low latency memory is precious resource Limitations –Regs zeroed, minimal shared/static data, no r-m-w buffers –Varying latencies: dependant on memory type accessed –Designed for independent operations (legacy of graphics) –Lots of gotchas that will kill performance –Hardware constantly changing Current generation –Proprietary architectures –NVIDIA G80, 128 ALUs, 350Gflop SP
4
IDRE GPU Lunch Programming Model Streaming –Elements (array) processed by a kernel (function) Sounds like a SIMD vector processor Not exactly, often term SPMD used (P=program) No index ops on streams Input stream(s) -> compute -> output stream No dependencies between stream elements –CUDA relaxes this somewhat Experimentation required –Balancing essential Compute rather than move data –Maximize use of precious low latency high bandwidth memory –Cover latencies with as much computation as possible High arithmetic intensity, you will hear this a lot! –Often better to re-compute than cache data –Avoid code that is memory bound Memory access progressing much slower than # of ALUs Better to batch memory moves into large transfers Complex memory access rules have major impact on performance
5
IDRE GPU Lunch Tools Cell SDK –Direct access to the hardware –Very low level CUDA (nvidia >8xxx) –C API – provides scalar execution model (with caveats) –Low level, think of MPI? Certain amount of hardware abstraction –User maps problem domain to processing units and memory hierarchy Re-imagining of graphics hardware to programming concepts (e.g. threads, arrays) GLSL, graphics tools, even lower level but not as necessary now –Kernel is : 1,2,3D Grid : Blocks : Threads Threads within block can communicate via on chip shared memory and synchronize Blocks are independent! –No communication between blocks –No execution ordering or concurrency guarantees –Free but specific to nvidia hardware (will hide future architecture changes) CTM (amd/ati) –Similar, but lower level than CUDA RapidMind –Integrate into C++ code –Higher level abstractions, think OpenMP? –SPMD oriented: e.g. streams and kernels, more restrictive than CUDA –Portable? –Let the experts do the mapping to memory hierarchy Several back-ends supported, Cell, GPUs, Multicore CPUs Allows tuning to specific hardware –Not free Brook, Sh –Opensource tools –Sh is precursor to Rapidmind kit
6
IDRE GPU Lunch Resources One stop shopping –http://www.gpgpu.orghttp://www.gpgpu.org More good stuff –http://www.rapidmind.com/resources.phphttp://www.rapidmind.com/resources.php Great survey paper –http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 Cell HPC presentation –http://www.power.org/resources/devcorner/cellcorner/hpcspe.pdfhttp://www.power.org/resources/devcorner/cellcorner/hpcspe.pdf Siggraph 2007 gpgpu course – very good –http://www.gpgpu.org/s2007/http://www.gpgpu.org/s2007/ IBM Cell –http://www.ibm.com/developerworks/power/cell/http://www.ibm.com/developerworks/power/cell/ NVIDIA –http://developer.nvidia.com/object/cuda.htmlhttp://developer.nvidia.com/object/cuda.html AMD/ATI –http://ati.amd.com/technology/streamcomputing/index.htmlhttp://ati.amd.com/technology/streamcomputing/index.html Rapidmind –http://developer.rapidmind.comhttp://developer.rapidmind.com Google is your friend, of course
7
IDRE GPU Lunch Conclusions This is the future of the highest performance codes –GPU, Cell or Larabee-sque multi-core –Industry is scaling cores not clocks –Industry contacts share that customers are 'in denial' and need to get on board. Programming is going to get whole lot more complex –Memory hierarchies –Load and system balancing –More and more doing it – fewer and fewer who are any good at it Education! Mapping problem domains to these architectures is still evolving –Lots of clever solutions to lots of problems Domain and algorithm level Tools are currently pretty weak –Industry appears to be aware of this – not just the market opportunity Hopefully –APIs will insolate us from variety and evolution of hardware
8
IDRE GPU Lunch Thank you Questions? Please feel free to contact me ATS has several resources that you can access to try some of these things out –Sony Playstation3, Cell SDK –nVidia 8800GTX, CUDA, Rapidmind
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.