Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Is There a Real Difference between DSPs and GPUs?

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

High Performing Cache Hierarchies for Server Workloads

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

History of Microprocessor MPIntroductionData BusAddress Bus

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Reza Yazdani Albert Segura José-María Arnau Antonio González

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Rachata Ausavarungnirun

Presented by: Isaac Martin

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis2 Focusing on Mobile GPUs Market demands Technology limitations Energy-efficient mobile GPUs Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D 2 2

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis3 GPU Performance and Memory Graphical workloads:  Large working sets not amenable to caching  Texture memory accesses are fine-grained and unpredictable Traditional techniques to deal with memory:  Caches  Prefetching  Multithreading A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial Android games

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis4 Outline Background Methodology Multithreading & Prefetching Decoupled Access/Execute Conclusions

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis5 Assumed GPU Architecture

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis6 Assumed Fragment Processor  4 threads per warp  4-wide vectorial registers (16 bytes)  36 registers per thread  Warp: group of threads executed in lockstep mode (SIMD group)

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis7 Methodology Power Model: CACTI 6.5 and Qsilver Main memoryLatency = 100 cycles Bandwidth = 4 bytes/cycle Pixel/Textures caches2 KB, 2-way, 2 cycles L2 cache32 KB, 8-way, 12 cycles Number of cores4 vertex, 4 pixel processors Warp width4 threads Register file size2304 bytes per warp Number of warps1-16 warps/core

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis8 Workload Selection 2D gamesSimple 3D gamesComplex 3D games Small/medium sized textures Texture filtering: 1 memory access Small fragment programs Small/medium sized textures Texture filtering: 1-4 memory accesses Small/medium fragment programs Medium/big sized textures Texture filtering: 4-8 memory accesses Big, memory intensive fragment programs

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis9 Improving Performance Using Multithreading Very effective High energy cost (25% more energy) Huge register file to maintain the state of all the threads  36 KB MRF for a GPU with 16 warps/core (bigger than L2)

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis10 Employing Prefetching Hardware prefetchers: Global History Buffer  K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, Many-Thread Aware  J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, Prefetching is effective but there is still ample room for improvement

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis11 Decoupled Access/Execute Use the fragment information to compute the addresses that will be requested when processing the fragment Issue memory requests while the fragments are waiting in the tile queue Tile queue size:  Too small: timeliness is not achieved  Too big: cache conflicts

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis12 Inter-Core Data Sharing 66.3% of cache misses are requests to data available in the L1 cache of another fragment processor Use the prefetch queue to detect inter-core data sharing Saves bandwidth to the L2 cache Saves power (L1 caches smaller than L2) Associative comparisons require additional energy

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis13 Decoupled Access/Execute  33% faster than hardware prefetchers, 9% energy savings  DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis14 Benefits of Remote L1 Cache Accesses  Single threaded GPU  Baseline: Global History Buffer  30% speedup  5.4% energy savings

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis15 Conclusions High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau (UPC) Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel) Thank you! Questions?