Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences.

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Multi-GPU System Design with Memory Networks
4/14/2017 Discussed Earlier segmentation - the process address space is divided into logical pieces called segments. The following are the example of types.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University P & H Chapter
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
CS 333 Introduction to Operating Systems Class 9 - Memory Management
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
CS333 Intro to Operating Systems Jonathan Walpole.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Revisiting Hardware-Assisted Page Walks for Virtualized Systems
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Sunpyo Hong, Hyesoon Kim
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
CS203 – Advanced Computer Architecture Virtual Memory.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Programmable Accelerators
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Gwangsun Kim, Jiyun Jeong, John Kim
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Supporting x86-64 Address Translation for 100s of GPU Lanes
Section 9: Virtual Memory (VM)
ISPASS th April Santa Rosa, California
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Mosaic: A GPU Memory Manager
Morgan Kaufmann Publishers
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Rachata Ausavarungnirun
CSCI206 - Computer Organization & Programming
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Address-Value Delta (AVD) Prediction
CS399 New Beginnings Jonathan Walpole.
Virtual Memory Overcoming main memory size limitation
CS 3410, Spring 2014 Computer Science Cornell University
Translation Lookaside Buffers
CS703 - Advanced Operating Systems
Main Memory Background
Border Control: Sandboxing Accelerators
Presentation transcript:

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences 2/19/2014

Summary CPUs & GPUs: physically integrated, logically separate Near Future: Cache coherence, Shared virtual address space Proof-of-concept GPU MMU design – Per-CU TLBs, highly-threaded PTW, page walk cache – Full x86-64 support – Modest performance decrease (2% vs. ideal MMU) Design alternatives not chosen 2/19/2014 UW-Madison Computer Sciences 2

Motivation Closer physical integration of GPGPUs Programming model still decoupled – Separate address spaces – Unified virtual address (NVIDIA) simplifies code – Want: Shared virtual address space (HSA hUMA) 2/19/2014 UW-Madison Computer Sciences 3

2/19/2014 UW-Madison Computer Sciences 4 Tree Index Hash-table Index CPU address space

Separate Address Space 2/19/2014 UW-Madison Computer Sciences 5 CPU address space GPU address space Simply copy data Transform to new pointers

Unified Virtual Addressing 2/19/2014 UW-Madison Computer Sciences 6 CPU address space GPU address space 1-to-1 addresses

void main() { int *h_in, *h_out; h_in = cudaHostMalloc(sizeof(int)*1024); // allocate input array on host h_in =... // Initial host array h_out = cudaHostMalloc(sizeof(int)*1024); // allocate output array on host Kernel >>(h_in, h_out);... h_out // continue host computation with result cudaHostFree(h_in); // Free memory on host cudaHostFree(h_out); } 2/19/2014 UW-Madison Computer Sciences 7 Unified Virtual Addressing

Shared Virtual Address Space 2/19/2014 UW-Madison Computer Sciences 8 CPU address space GPU address space 1-to-1 addresses

Shared Virtual Address Space No caveats Programs “just work” Same as multicore model 2/19/2014 UW-Madison Computer Sciences 9

Shared Virtual Address Space Simplifies code Enables rich pointer-based datastructures – Trees, linked lists, etc. Enables composablity Need: MMU (memory management unit) for the GPU Low overhead Support for CPU page tables (x86-64) 4 KB pages Page faults, TLB flushes, TLB shootdown, etc 2/19/2014 UW-Madison Computer Sciences 10

Outline Motivation Data-driven GPU MMU design D1: Post-coalescer MMU D2: +Highly-threaded page table walker D3: +Shared page walk cache Alternative designs Conclusions 2/19/2014 UW-Madison Computer Sciences 11

System Overview 2/19/2014 UW-Madison Computer Sciences 12

GPU Overview 2/19/2014 UW-Madison Computer Sciences 13

Methodology Target System – Heterogeneous CPU-GPU system 16 CUs, 32 lanes each – Linux OS Simulation – gem5-gpu – full system mode (gem5-gpu.cs.wisc.edu) Ideal MMU – Infinite caches – Minimal latency 2/19/2014 UW-Madison Computer Sciences 14

Workloads Rodinia benchmarks Database sort – 10 byte keys, 90 byte payload 2/19/2014 UW-Madison Computer Sciences 15 –backprop –bfs: Breadth-first search –gaussian –hotspot –lud: LU-decomposition –nn: nearest-neighbor –nw: Needleman-Wunsch –pathfinder –srad: anisotropic diffusion

GPU MMU Design 0 2/19/2014 UW-Madison Computer Sciences 16

GPU MMU Design 1 Per-CU MMUs: Reducing the translation request rate 2/19/2014 UW-Madison Computer Sciences 17

Lane 2/19/2014 UW-Madison Computer Sciences 18

Lane Scratchpad Memory 1x 0.45x 2/19/2014 UW-Madison Computer Sciences 19

Lane Coalescer 1x 0.45x 0.06x 2/19/2014 UW-Madison Computer Sciences 20 Scratchpad Memory

Reducing translation request rate Shared memory and coalescer effectively filter global memory accesses Average 39 TLB accesses per 1000 cycles for 32 lane CU Breakdown of memory operations 2/19/2014 UW-Madison Computer Sciences 21

GPU MMU Design 1 2/19/2014 UW-Madison Computer Sciences 22

Performance 2/19/2014 UW-Madison Computer Sciences 23

GPU MMU Design 2 Highly-threaded page table walker: Increasing TLB miss bandwidth 2/19/2014 UW-Madison Computer Sciences 24

Outstanding TLB Misses Lane Coalescer Scratchpad Memory 2/19/2014 UW-Madison Computer Sciences 25

Multiple Outstanding Page Walks Many workloads are bursty Miss latency skyrockets due to queuing delays if blocking page walker Concurrent page walks per CU when non-zero (log scale) 2/19/2014 UW-Madison Computer Sciences 2 26

GPU MMU Design 2 2/19/2014 UW-Madison Computer Sciences 27

Highly-threaded PTW 2/19/2014 UW-Madison Computer Sciences 28

Performance 2/19/2014 UW-Madison Computer Sciences 29

GPU MMU Design 3 Page walk cache: Overcoming high TLB miss rate 2/19/2014 UW-Madison Computer Sciences 30

High L1 TLB Miss Rate Average miss rate: 29%! 2/19/2014 UW-Madison Computer Sciences Per-access Rate with 128 entry L1 TLB 31

High TLB Miss Rate Requests from coalescer spread out – Multiple warps issuing unique streams – Mostly 64 and 128 byte requests Often streaming access patterns – Each entry not reused many times – Many demand misses 2/19/2014 UW-Madison Computer Sciences 32

GPU MMU Design 3 2/19/2014 UW-Madison Computer Sciences 33 Shared by 16 CUs

Performance 2/19/2014 UW-Madison Computer Sciences Worst case: 12% slowdown Average: Less than 2% slowdown 34

Correctness Issues Page faults – Stall GPU as if TLB miss – Raise interrupt on CPU core – Allow OS to handle in normal way – One change to CPU microcode, no OS changes TLB flush and shootdown – Leverage CPU core, again – Flush GPU TLBs whenever CPU core TLB is flushed 2/19/2014 UW-Madison Computer Sciences 35

Page-fault Overview 1.Write faulting address to CR2 2.Interrupt CPU 3.OS handles page fault 4.iret instruction executed 5.GPU notified of completed page fault 2/19/2014 UW-Madison Computer Sciences 36

Page faults Number of page faults Average page fault latency Lud11698 cycles Pathfinder cycles Cell cycles Hotspot cycles Backprop cycles Minor Page faults handled correctly by Linux cycles = 3.5 µsec: too fast for GPU context switch 2/19/2014 UW-Madison Computer Sciences 37

Summary D3: proof of concept Non-exotic design 2/19/2014 UW-Madison Computer Sciences Per-CU L1 TLB entries Highly-threaded page table walker Page walk cache size Shared L2 TLB entries Ideal MMUInfinite None Design 0N/A: Per-lane MMUsNone Design 1128Per-CU walkersNone Design 2128Yes (32-way)None Design 364Yes (32-way)8 KBNone Low overheadFully compatibleCorrect execution 38

Outline Motivation Data-driven GPU MMU design Alternative designs – Shared L2 TLB – TLB prefetcher – Large pages Conclusions 2/19/2014 UW-Madison Computer Sciences 39

Alternative Designs Shared L2 TLB – Shared between all CUs – Captures address translation sharing – Poor performance: increased TLB-miss penalty Shared L2 TLB and PWC – No performance difference – Trade-offs in the design space 2/19/2014 UW-Madison Computer Sciences 40

Alternative Design Summary 2/19/2014 UW-Madison Computer Sciences Each design has 16 KB extra SRAM for all 16 CUs PWC needed for some workloads Trade-offs in energy and area Per-CU L1 TLB entries Page walk cache size Shared L2 TLB entries Performance Design 3648 KBNone 0% Shared L264None % Shared L2 & PWC 328 KB512 0% 41

GPU MMU Design Tradeoffs Shared L2 & PWC provices lower area, but higher energy Proof-of-concept Design 3 provides low area and energy 2/19/2014 UW-Madison Computer Sciences *data produced using McPAT and CACTI

Alternative Designs TLB prefetching: – Simple 1-ahead prefetcher does not affect performance Large pages: – Effective at reducing TLB misses – Requiring places burden on programmer – Must have compatibility 2/19/2014 UW-Madison Computer Sciences 43

Related Work “Architectural Support for Address Translation on GPUs.” Bharath Pichai, Lisa Hsu, Abhishek Bhattacharjee (ASPLOS 2014). – Concurrent with our work – High level: modest changes for high performance – Pichai et al. investigate warp scheduling – We use a highly-threaded PTW 2/19/2014 UW-Madison Computer Sciences 44

Conclusions Shared virtual memory is important Non-exotic MMU design – Post-coalescer L1 TLBs – Highly-threaded page table walker – Page walk cache Full compatibility with minimal overhead 2/19/2014 UW-Madison Computer Sciences 45

Questions? 2/19/2014 UW-Madison Computer Sciences 46

Simulation Details backprop74 MBnn500 KB bfs37 MBnw128 MB gaussian340 KBpathfinder38 MB hotspot12 MBsrad96 MB lud4 MBsort208 MB CPU1 core, 2 GHz, 64 KB L1, 2 MB L2 GPU16 CUs, 1.4 GHz, 32 lanes L1 Cache (per-CU)64 KB, 15 ns latency Scratchpad memory16 KB, 15 ns latency GPU L2 cache1 MB, 16-way set associative, 130 ns latency DRAM2 GB, DDR3 timing, 8 channels, 667 MHz 2/19/2014 UW-Madison Computer Sciences 47

Shared L2 TLB 2/19/2014 UW-Madison Computer Sciences Sharing pattern Many benchmarks share translations between CUs 48

Perf. of Alternative Designs D3 uses AMD-like physically-addressed page walk cache Ideal PWC performs slightly better (1% avg, up to 10%) 2/19/2014 UW-Madison Computer Sciences 49

Large pages are effective 2 MB pages significantly reduce miss rate But can they be used? – slow adoption on CPUs – Special allocation API Relative miss rate with 2MB pages 2/19/2014 UW-Madison Computer Sciences 50

Shared L2 TLB 2/19/2014 UW-Madison Computer Sciences 51

Shared L2 TLB and PWC 2/19/2014 UW-Madison Computer Sciences 52

Shared L2 TLB 2/19/2014 UW-Madison Computer Sciences Performance relative to D3 A shared L2 TLB can work as well as a page walk cache Some workloads need low latency TLB misses 53

Shared L2 TLB and PWC Benefits of both page walk cache and shared L2 TLB Can reduce the size of L1 TLB without performance penalty 2/19/2014 UW-Madison Computer Sciences 54

Unified MMU Design Memory request (virtual addr) Memory request (physical addr) Page walk request & response 2/19/2014 UW-Madison Computer Sciences 55