Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Similar presentations


Presentation on theme: "Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis."— Presentation transcript:

1 Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

2 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis2 Focusing on Mobile GPUs Market demands Technology limitations Energy-efficient mobile GPUs 1 1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D 2 http://www.ispsd.com/02/battery-psd-templates/ 2

3 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis3 GPU Performance and Memory Graphical workloads:  Large working sets not amenable to caching  Texture memory accesses are fine-grained and unpredictable Traditional techniques to deal with memory:  Caches  Prefetching  Multithreading A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial Android games

4 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis4 Outline Background Methodology Multithreading & Prefetching Decoupled Access/Execute Conclusions

5 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis5 Assumed GPU Architecture

6 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis6 Assumed Fragment Processor  4 threads per warp  4-wide vectorial registers (16 bytes)  36 registers per thread  Warp: group of threads executed in lockstep mode (SIMD group)

7 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis7 Methodology Power Model: CACTI 6.5 and Qsilver Main memoryLatency = 100 cycles Bandwidth = 4 bytes/cycle Pixel/Textures caches2 KB, 2-way, 2 cycles L2 cache32 KB, 8-way, 12 cycles Number of cores4 vertex, 4 pixel processors Warp width4 threads Register file size2304 bytes per warp Number of warps1-16 warps/core

8 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis8 Workload Selection 2D gamesSimple 3D gamesComplex 3D games Small/medium sized textures Texture filtering: 1 memory access Small fragment programs Small/medium sized textures Texture filtering: 1-4 memory accesses Small/medium fragment programs Medium/big sized textures Texture filtering: 4-8 memory accesses Big, memory intensive fragment programs

9 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis9 Improving Performance Using Multithreading Very effective High energy cost (25% more energy) Huge register file to maintain the state of all the threads  36 KB MRF for a GPU with 16 warps/core (bigger than L2)

10 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis10 Employing Prefetching Hardware prefetchers: Global History Buffer  K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. Many-Thread Aware  J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010. Prefetching is effective but there is still ample room for improvement

11 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis11 Decoupled Access/Execute Use the fragment information to compute the addresses that will be requested when processing the fragment Issue memory requests while the fragments are waiting in the tile queue Tile queue size:  Too small: timeliness is not achieved  Too big: cache conflicts

12 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis12 Inter-Core Data Sharing 66.3% of cache misses are requests to data available in the L1 cache of another fragment processor Use the prefetch queue to detect inter-core data sharing Saves bandwidth to the L2 cache Saves power (L1 caches smaller than L2) Associative comparisons require additional energy

13 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis13 Decoupled Access/Execute  33% faster than hardware prefetchers, 9% energy savings  DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings

14 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis14 Benefits of Remote L1 Cache Accesses  Single threaded GPU  Baseline: Global History Buffer  30% speedup  5.4% energy savings

15 Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis15 Conclusions High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings

16 Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau (UPC) Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel) Thank you! Questions?


Download ppt "Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis."

Similar presentations


Ads by Google