Download presentation
Presentation is loading. Please wait.
1
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC May 6th, 2004
2
May 6, 20042 Motivation GPUs are faster than CPUs GPUs are getting faster, faster Why? –Massive parallelism (1000s of ALUs) –Choreographed communication –Efficiently utilize VLSI resources [DIS/PCA mantra] Programmable GPUs = stream processors Many streaming applications beyond graphics Buy desktop supercomputer for $50! Revolutionize computing?
3
May 6, 20043 Recent Performance Trends
4
May 6, 20044
5
5 CPU vs GPU Intel 3 Ghz Pentium 4 –12 GFLOPS peak performance (via SSE2) –5.96 GB/sec peak memory bandwidth –44 GB/sec peak bandwidth from 8K L1 data cache NVIDIA GeForce 6800 –45 GFLOPS peak performance –36 GB/sec peak memory bandwidth –Texture cache bandwidth and size (undisclosed)?
6
May 6, 20046 Deliverables Develop version of PCA Brook for GPUs –Programmer need not know GL Versions –New ATI (420) and NVIDIA (NV40) hardware –Linux and Windows –DX and OpenGL Release as open source [V1.0 Dec 2003] Support OneSAF LOS, collision detection and route planning algorithms
7
May 6, 20047 Research Issues Brook semantics –E.g. variable length streams: vout –… Compilation techniques –Virtualization of GPU –Splitting kernels (MRDS) Explore streaming application space –Scientific computing: RT, MD, BLAS, FFT, … –Machine learning: HMM, linear mod., Bayes, …
8
Brook Update Ian Buck
9
May 6, 20049
10
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
11
May 6, 200411 Dense Matrix-Matrix Multiplication Atlas on the Intel P4 wins!
12
May 6, 200412 CPU vs GPU Intel 3 Ghz Pentium 4 –12 GFLOPS peak performance (via SSE2) –5.96 GB/sec peak memory bandwidth –44 GB/sec peak bandwidth from 8K L1 data cache NVIDIA GeForce 6800 –43 GFLOPS peak performance –36 GB/sec peak memory bandwidth –Texture cache bandwidth and size (undisclosed)? Why is graphics hardware so slow?
14
May 6, 200414 Why is Graphics Hardware so Slow? GFLOPSCache BWSeq Read BW NV3539.99 11.08 4.40 NV4043.00 18.9 3.85 ATI 9800XT26.14 12.20 7.33 ATI X80033.4 30.7 18.4 Microbenchmark (MAD) NVIDIA: 8% compute efficiency, 82% of cache bandwidth. Arithmetic intensity: 12 math operations per float fetched from cache ATI: 18% of peak performance, 99% of peak cache bandwidth. Arithimetic intensity: 8 to 1 math to cache-fetch ratio
15
May 6, 200415 Why is Graphics Hardware so Slow? Matrix-matrix multiplication is bandwidth limited on GPU. –Memory blocking to increase cache utilization does not help –Architectural problem, not programming model problem PCA stream processing architectures (Imagine) will do much better! GFLOPSBandwidth NV353.049.07 NV407.2414.88 ATI 9800XT4.8312.06 ATI X800~12~30 P47.7827.68 Matrix-Matrix Multiplication
16
Variable Output Shaders Daniel Horn, Ian Buck, Pat Hanrahan
17
May 6, 200417 Motivation: Enabling Algorithms Not all algorithms map to the 1-in 1-out semantics of GPUs Other classes of algorithms require data filtering (1-in 0-out) and amplification (1-in n-out). Vout is conditional write on Imagine
18
May 6, 200418 Algorithms Ray Tracing terrains Marching Cubes Adaptive Subdivision Surfaces Collision Detection [OBB] Graph traversal …
19
May 6, 200419 Implementation on GPU Push output (sentinel if no push) Options to consolidate sentinels: –Sort O(n (log n)^2) Sort sentinels to the end, truncate –Scan/Search O(n log n) Perform a running sum, then search for gather loc –Scan/Scatter O(n log n) Perform a running sum, scatter to destination –Constant time hardware implementation
20
May 6, 200420 Timing and Bandwidth Numbers
21
May 6, 200421 Future Work Brook: semantics, compiling, virtualization –Support new GPU features (branching, FB ops, …) –Predication Integration with graphics pipeline –Documented path to texture for rendering –Access to other GPU features: e.g. occlusion culling Interactive simulation; new algorithms –Collision detection and line of sight calculations Merge ray tracer with UNC/SAIC algorithm –Machine learning: HMM, GLM, K-means,... –Protein folding (StreamMD) and docking –Virtual surgery
22
May 6, 200422 Distributed Brook Stream- and thread-level parallelism UPC distributed memory semantics PCI-express system for fast readback
23
May 6, 200423 GPU Cluster [DOE] 16 node cluster Each node 3U half depth 32 2.4GHz P4 Xeons 16GB DDR 1.2TB disk Infiniband 4X interconnect Dual 2.4GHz P4 Xeons Intel E7505 chipset 1GB DDR ATI Radeon 9800 Pro 256MB GigE 80 GB IDE
24
May 6, 200424 Questions? Fly-fishing fly images from The English Fly Fishing ShopThe English Fly Fishing Shop
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.