Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Lecture 12 Reduce Miss Penalty and Hit Time

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Computer Organization and Architecture

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Improving Cache Performance by Exploiting Read-Write Disparity

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Sunpyo Hong, Hyesoon Kim

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Cache-Conscious Wavefront Scheduling

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

CSC 4250 Computer Architectures

Computer Structure Multi-Threading

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Presented by: Isaac Martin

MASK: Redesigning the GPU Memory Hierarchy

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Miss Rate versus Block Size

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Operation of the Basic SM Pipeline

Warp Scheduling.

Presentation transcript:

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi University of Michigan

GPU usage is expanding Graphics Data Analytics Linear Algebra Machine Learning GPUs are taking centerstage of HPC. Traditional applications like graphics, linear algebra and other simulations are getting more complex and use GPUs as the computational workhorse Modern application like X,Y,Z not only want to exploit the computational power of GPUs, but also exploit the high memory bandwidth provided by GDDR5 Therefore all kinds of data parallel apps, be it compute or memory intensive are exploiting GPUs. look at the performance achieved by the different kinds kernels. All kinds of applications, compute and memory intensive are targeting GPUs Simulation Computer Vision

Performance variation of kernels Compute Intensive Memory Intensive the associated problems with bandwidth or memory saturation Memory intensive kernels saturate bandwidth and get lower performance

Impact of memory saturation - I L1 FPUs LSU L1 FPUs LSU L1 FPUs LSU . . . . The impact of bandwidth saturation on the system is shown through an illustration LSU is load store unit a warp can issue only when one returns Memory System Memory intensive kernels serialize memory requests Critical to prioritize order of requests from SMs

Impact of memory saturation Compute Intensive Memory Intensive To quantify the impact of memory saturation, I will show the earlier plot and add to it the fraction of cycles the LSU is stalled As the memory intensity increases, the cycles the LSU stall increase and become as high as 90%. This shows that the impact of memory saturation on a core receiving data to begin processing is significant. other problems associated with saturation of memory system. Significant stalls in LSU correspond to low performance in memory intensive kernels

Impact of memory saturation - II L1 FPUs LSU W1 W0 L1 LSU FPUs L1 FPUs LSU L1 FPUs LSU . . . . why cache cannot accept new requests Memory System Cache-blocks W1 Data Data present in the cache, but LSU can’t access Unable to feed enough data for processing

Increasing memory resources Large # of MSHRs + Full Associativity + 20% bandwidth boost UNBUILDABLE

Mas + car Mascar During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar

Memory Aware Scheduling Saturation Warp 0 Warp 1 Warp 2 Memory requests Memory requests Memory requests the shortcoming of current solutions and how MAS improves the overlap of compute and memory with an analogy. Serving one request and switching to another warp (RR) No warp is ready to make forward progress

Memory Aware Scheduling Saturation Warp 0 Warp 1 Warp 2 GTO issues instructions from another warp whenever: No instruction in the i-buffer for that warp Dependency between instructions Similar to RR as multiple warps may issue memory requests Memory requests Memory requests Memory requests Serving one request and switching to another warp (RR) No warp is ready to make forward progress Serving all requests from one warp and then switching to another One warp is ready to begin computation early (MAS)

MAS operation N Y N Y Schedule in Equal Priority mode Check if memory intensive (MSHRs or miss queue almost full) Y Schedule in Memory Priority mode Assign new owner warp (only owner’s request can go beyond L1) Execute memory inst. only from owner, other warps can execute compute inst. Is the next instruction of owner dependent on already issued load N Y MP mode

Memory saturation flag Implementation of MAS Decode I-Buffer Scoreboard Scheduler . WST Ordered Warps Warp Id WRC Stall bit Memory saturation flag OPtype . .. Mem_Q Head Comp_Q Head . Issued Warp From RF Divide warps as memory and compute warps in ordered warps Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions Warp Status Table: Decide if scheduler should schedule from a warp

Mas + car Mascar During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mas + car Mascar

Cache access re-execution L1 Cache Load Store Unit W1 W2 W0 Cache-blocks HIT W1 Data One can argue that this problem can be solved by adding more MSHRs. Re-execution Queue Better than adding more MSHRs: More MSHR cause faster saturation of memory Cause faster thrashing of data cache

Experimental methodology GPGPUSim 3.2.2 – GTX 480 architecture SMs – 15, 32 PEs/SM Schedulers – LRR, GTO, OWL and CCWS L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs L2 cache – 768kB, 8 way, 6 partitions, 200 core cycles DRAM – 32 requests/partition, 440 core cycles

Performance of compute intensive kernels Performance of compute intensive kernels is insensitive to scheduling policies

Performance of memory intensive kernels 3.0 4.8 4.24 Bandwidth intensive Cache Sensitive Overall categories and comparison with CCWS and leuko-1 Scheduler GTO OWL CCWS Mascar Bandwidth Intensive 4% 17% Cache Sensitive 24% 4% 55% 56% 13% 4% 24% 34% Overall

Conclusion 34% speedup 12% energy savings During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) exploits more hit-under-miss opportunities through re-execution queue 34% speedup 12% energy savings

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Questions??