WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O’Neil and Martin Burtscher Department of Computer Science.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Understanding Outstanding Memory Request Handling Resources in GPGPUs
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Cache-Conscious Wavefront Scheduling
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Performance in GPU Architectures: Potentials and Distances
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Improving Memory Access 1/3 The Cache and Virtual Memory
ISPASS th April Santa Rosa, California
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
University of Michigan, Ann Arbor
RegLess: Just-in-Time Operand Staging for GPUs
Rachata Ausavarungnirun
Presented by: Isaac Martin
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 20: OOO, Memory Hierarchy
Operation of the Basic SM Pipeline
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, Scott Mahlke Computer Engineering Laboratory University of Michigan

Introduction GPUs have high peak performance For many benchmarks, memory throughput limits performance

GPU Architecture 32 threads grouped into SIMD warps Warp scheduler sends ready warps to FUs warp add r1, r2, r3 thread load [r1], r2 ... warp 0 1 2 47 warp scheduler ALUs Load/Store Unit

GPU Memory System Warp Scheduler Load Load/Store Unit Group by cache line Intra-Warp Coalescer L1 Cache Lines Walk through MSHR to L2, DRAM

… Problem: Divergence Warp Scheduler Load Load/Store Unit Group by cache line Intra-Warp Coalescer L1 Cache Lines Problem statement here: “However, there are access patterns where this merging does not work.” Super important to get across here: here, the bottleneck is not off-chip b/w or anything: it’s that ***we can’t get to the L1 cache to service anything, including hits*** -Need to set up that fixing the pipeline to the cache is something worth doing MSHR … to L2, DRAM

Problem: Bottleneck at L1 Warp 0 Warp 1 Warp Scheduler Warp 2 Warp 3 Loads Warp 4 Warp 5 Load/Store Unit Group by cache line Intra-Warp Coalescer Warp 0 Warp 1 Warp 5 Warp 4 L1 Warp 3 In cases where there are loads waiting to be scheduled, we need to make every access to the cache count. Example: scalar load MSHR Warp 2 to L2, DRAM

Hazards in Benchmarks Memory Divergent Bandwidth-Limited Cache-Limited Problem caused by too many requests Can we merge requests? Doubled cache sets or ways? Big was 64x128x8 Small was 64x128x4 Doubled cache ways. Reason: hash algorithm Memory Divergent Bandwidth-Limited Cache-Limited

Inter-Warp Spatial Locality Spatial locality not just within a warp warp 0 divergent inside a warp warp 1 warp 2 warp 3 warp 4

Inter-Warp Spatial Locality Spatial locality not just within a warp warp 0 warp 1 warp 2 warp 3 warp 4

Inter-Warp Spatial Locality Spatial locality not just within a warp Key insight: use this locality to address throughput bottlenecks warp 0 warp 1 warp 2 warp 3 warp 4

Inter-Warp Window Warp Scheduler Intra-Warp Coalescer L1 1 per cycle 32 addresses 1 cache line from one warp multiple per cycle: divergence Inter-Warp Window Intra-Warp Coalescer Warp Scheduler Intra-Warp Coalescer Inter-Warp Coalescer L1 Make this similar to the problem slide with the bottlenecks? Key things to get across: -buffer requests in the inter-warp window: backpressure will be good! -make intra-warp coalescer wider because we can service Old vs. new comparison? 1 per cycle, but on behalf of multiple loads: bandwidth 32 addresses 1 cache line from one warp many cache lines from many warps 1 cache line from many warps

Design Overview Intra-Warp Coalescer Warp Scheduler Inter-Warp Coalescers Inter-Warp Queues Selection Logic Warp Scheduler L1 -Use matrix transpose example?

Intra-Warp Coalescers Address Generation Warp Scheduler Intra-Warp Coalescer load to inter-warp coalescer ... load Queue memory instructions -Requests can stay in the intra-warp coalescers for multiple cycles. Queue load instructions before address generation Intra-warp coalescers same as baseline 1 request for 1 cache line exits per cycle

Inter-Warp Coalescer ... Many coalescing queues, small # tags each intra-warp coalescers Cache line address warp ID thread mapping ... W0 W0 ... sort by address Many coalescing queues, small # tags each Requests mapped to coalescing queues by address Insertion: tag lookup, max 1 per cycle per queue How this works: -the inter-warp queues have a request to issue each cycle -there is a mapping of the address of that request to an inter-warp queue - Why many queues:

Inter-Warp Coalescer ... Many coalescing queues, small # tags each intra-warp coalescers W0 Cache line address warp ID thread mapping ... W0 W0 ... sort by address Many coalescing queues, small # tags each Requests mapped to coalescing queues by address Insertion: tag lookup, max 1 per cycle per queue How this works: -the inter-warp queues have a request to issue each cycle -there is a mapping of the address of that request to an inter-warp queue - Why many queues:

Inter-Warp Coalescer ... Many coalescing queues, small # tags each intra-warp coalescers Cache line address warp ID thread mapping ... W1 W1 ... sort by address Many coalescing queues, small # tags each Requests mapped to coalescing queues by address Insertion: tag lookup, max 1 per cycle per queue How this works: -the inter-warp queues have a request to issue each cycle -there is a mapping of the address of that request to an inter-warp queue - Why many queues:

Inter-Warp Coalescer ... Many coalescing queues, small # tags each intra-warp coalescers Cache line address warp ID thread mapping 1 ... ... sort by address Many coalescing queues, small # tags each Requests mapped to coalescing queues by address Insertion: tag lookup, max 1 per cycle per queue How this works: -the inter-warp queues have a request to issue each cycle -there is a mapping of the address of that request to an inter-warp queue - Why many queues:

Selection Logic Select a cache line from the inter-warp queues to send to L1 2 strategies: Default: pick oldest request Cache-sensitive: prioritize one warp Switch based on miss rate over quantum L1 Cache ... Selection Logic Question here: if inter-warp coalescing is good, why not do RR? -Answer: because we have to maintain intra-warp temporal locality. Have done experiments with RR and also increasing latency in coalescer to get more coalesces. Tends not to be worth it: there’s a optimal point between priority and coalescing, and this design got close to it.

Methodology Implemented in GPGPU-sim 3.2.2 GTX480 baseline 32 MSHRS 32kB cache GTO scheduler Verilog implementation for power and area Benchmark criteria Parboil, PolyBench, Rodinia benchmark suites Memory throughput limited: waiting memory requests for more than 90% of execution time WarpPool configuration 2 intra-warp coalescers 32 inter-warp queues 100,000 cycle quantum for request selector Up to 4 inter-warp coalesces per L1 access

Results: Speedup Memory Divergent Bandwidth-Limited Cache-Limited 2.35 3.17 5.16 1.38x 2 mechanisms: increased L1 throughput, decreased L1 misses What to talk about this slidE: -What the related work was -Point out that banked cache didn’t see speedup Have callouts for the bars we talk about -This slide: highlight benchmarks for L1 thoughput, L1 misses mechanisms Memory Divergent Bandwidth-Limited Cache-Limited [1] MRPB: Memory request prioritization for massively parallel processors: HPCA 2014

Results: L1 Throughput Banked cache uses divergence, not locality Key insights here: -8 banks often better because doesn’t require matches with other warps -B/W limited: the backup of instructions gives us more opportunity But no speedup b/c only 1 miss/cycle, on behalf of one warp Memory Divergent Bandwidth-Limited Cache-Limited Banked cache uses divergence, not locality WarpPool merges even when not divergent No speedup for banked cache: 1 miss/cycle

Results: L1 Misses MRPB has larger queues The difference between our prioritization and MRPB: We can reorder requests, not just load instructions Kmeans_2: high level of divergence, we can interrupt a load that asks for 32 difference cache lines When we do worse: because MRPB is bigger, because it doesn’t hold addresses Memory Divergent Bandwidth-Limited Cache-Limited MRPB has larger queues Oldest policy sometimes preserves cross-warp temporal locality [1] MRPB: Memory request prioritization for massively parallel processors: HPCA 2014

Conclusion Many kernels limited by memory throughput Key insight: use inter-warp spatial locality to merge requests WarpPool improves performance by 1.38x: Merging requests: increase L1 throughput by 8% Prioritizing requests: decrease L1 misses by 23%

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, Scott Mahlke Computer Engineering Laboratory University of Michigan