Programmable Accelerators

Slides:



Advertisements
Similar presentations
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences.
Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.
1 A Real Problem  What if you wanted to run a program that needs more memory than you have?
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Lecture 17: Virtual Memory, Large Caches
A. Frank - P. Weisberg Operating Systems Simple/Basic Paging.
Toward Cache-Friendly Hardware Accelerators
EECS 370 Discussion 1 SMBC. EECS 370 Discussion Exam 2 Solutions posted online Will be returned in next discussion (12/9) – Grades hopefully up on CTools.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
Please do not distribute
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Fundamentals of Programming Languages-II Subject Code: Teaching SchemeExamination Scheme Theory: 1 Hr./WeekOnline Examination: 50 Marks Practical:
Virtual Memory 1 1.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Redundant Memory Mappings for Fast Access to Large Memories
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Translation Lookaside Buffer
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
CS 140 Lecture Notes: Virtual Memory
Gwangsun Kim, Jiyun Jeong, John Kim
Vivek Seshadri 15740/18740 Computer Architecture
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Memory COMPUTER ARCHITECTURE
Supporting x86-64 Address Translation for 100s of GPU Lanes
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Lecture: Large Caches, Virtual Memory
Section 9: Virtual Memory (VM)
Today How was the midterm review? Lab4 due today.
Mosaic: A GPU Memory Manager
Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.
Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?
Cache Memory Presentation I
CSE 153 Design of Operating Systems Winter 2018
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Heterogeneous System coherence for Integrated CPU-GPU Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Rachata Ausavarungnirun
CMSC 611: Advanced Computer Architecture
CS 140 Lecture Notes: Virtual Memory
Cache Memories September 30, 2008
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Devirtualizing Memory in Heterogeneous Systems
ECE Dept., University of Toronto
CS399 New Beginnings Jonathan Walpole.
CS 140 Lecture Notes: Virtual Memory
Virtual Memory Hardware
Translation Lookaside Buffer
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
Computer Architecture
Translation Lookaside Buffers
CS703 - Advanced Operating Systems
CSE 153 Design of Operating Systems Winter 2019
CS 140 Lecture Notes: Virtual Memory
Cache writes and examples
Border Control: Sandboxing Accelerators
Virtual Memory 1 1.
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu cs.wisc.edu/~powerjg

Increasing specialization Need to program these accelerators Challenges 1. Consistent pointers 2. Data movement 3. Security (Fast) This talk: GPGPUs http://www.anandtech.com/show/4144/lg-optimus-2x-nvidia-tegra-2-review-the-first-dual-core-smartphone/3 *NVIDIA via anandtech.com

Programming accelerators (baseline) int main() { int a[N], b[N], c[N]; init(a, b, c); add(a, b, c); return 0; } Accelerator-side code CPU-Side code

Programming accelerators (baseline) int main() { int a[N], b[N], c[N]; init(a, b, c); add(a, b, c); return 0; } void add(int*a, int*b, int*c) { for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Accelerator-side code CPU-Side code

Programming accelerators (GPU) int main() { int a[N], b[N], c[N]; init(a, b, c); add(a, b, c); return 0; } void add_gpu(int*a, int*b, int*c) { for (int i = get_global_id(0); i < N; i += get_global_size(0)) { c[i] = a[i] + b[i]; } Accelerator-side code CPU-Side code

Programming accelerators (GPU) int main() { int a[N], b[N], c[N]; int *d_a, *d_b, *d_c; cudaMalloc(&d_a, N*sizeof(int)); cudaMalloc(&d_b, N*sizeof(int)); cudaMalloc(&d_c, N*sizeof(int)); init(a, b, c); cudaMemcpy(d_a, a, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, N*sizeof(int), add_gpu(a, b, c); cudaMemcpy(c, d_c, N*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } void add_gpu(int*a, int*b, int*c) { for (int i = get_global_id(0); i < N; i += get_global_size(0)) { c[i] = a[i] + b[i]; } Accelerator-side code CPU-Side code

Programming accelerators (GOAL) int main() { int a[N], b[N], c[N]; int *d_a, *d_b, *d_c; cudaMalloc(&d_a, N*sizeof(int)); cudaMalloc(&d_b, N*sizeof(int)); cudaMalloc(&d_c, N*sizeof(int)); init(a, b, c); cudaMemcpy(d_a, a, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, N*sizeof(int), add_gpu(a, b, c); cudaMemcpy(c, d_c, N*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } void add_gpu(int*a, int*b, int*c) { for (int i = get_global_id(0); i < N; i += get_global_size(0)) { c[i] = a[i] + b[i]; } Accelerator-side code CPU-Side code

Programming accelerators (GOAL) int main() { int a[N], b[N], c[N]; init(a, b, c); add_gpu(a, b, c); return 0; } void add_gpu(int*a, int*b, int*c) { for (int i = get_global_id(0); i < N; i += get_global_size(0)) { c[i] = a[i] + b[i]; } Accelerator-side code CPU-Side code

Key challenges a[i] Memory CPU Cache MMU Virtual address pointer ld: 0x1000000 Cache MMU 0x5000 Physical address Memory

? ? Key challenges a[i] Consistent pointers Data movement Memory CPU MMU Cache GPU a[i] Virtual address pointer ld: 0x1000000 Cache ? MMU 0x5000 ? Consistent pointers Data movement Memory

Consistent pointers Data movement

Consistent pointers Data movement Supporting x86-64 Address Translation for 100s of GPU Lanes Data movement Heterogeneous System Coherence High bandwidth address translation: Note: Focus on proof-of-concept not highest possible performance High bandwidth cache coherence [HPCA 2014] [MICRO 2014]

Shared memory controller Heterogeneous System GPU Core GPU Core GPU Core GPU Core CPU Core CPU Core CPU Core CPU Core L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L1 L1 L1 L1 GPU Core GPU Core GPU Core GPU Core Directory The default for this is the old non-integrated programming model. However, the tight physical integration gives us the chance for tight logical integration, which is what we’re looking at here. Shared memory controller

Theoretical Memory Bandwidth Why not CPU solutions? It’s all about bandwidth! Translating 100s of addresses 500 GB/s at the directory (many accesses per-cycle) GB/s *NVIDIA via anandtech.com

Consistent pointers Data movement Supporting x86-64 Address Translation for 100s of GPU Lanes Data movement Heterogeneous System Coherence High bandwidth address translation: Note: Focus on proof-of-concept not highest possible performance High bandwidth cache coherence [HPCA 2014] [MICRO 2014]

Why virtual addresses? Virtual memory

Transform to new pointers Transform to new pointers Why virtual addresses? Virtual memory Simply copy data Transform to new pointers Transform to new pointers GPU address space

Bandwidth problem CPU TLB Virtual memory requests Physical memory requests

Bandwidth problem GPU Processing Elements (one GPU core) Lane Lane

Solution: Filtering GPU Processing Elements (one GPU core) Lane Lane

Solution: Filtering GPU Processing Elements (one GPU core) 1x 0.45x Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane 1x Shared Memory (scratchpad) 0.45x

Solution: Filtering TLB GPU Processing Elements (one GPU core) 1x Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane 1x Shared Memory (scratchpad) Coalescer 0.45x 0.06x TLB

Design 1 TLB Page table walker TLB Shared page walk unit Lane Shared Memory Coalescer TLB Shared page walk unit Page table walker Coalescer Lane Shared Memory TLB

Poor performance Average 3x slowdown

Bottleneck 1: Bursty TLB misses Average: 60 outstanding requests Max 140 requests Huge queuing delays Solution: Highly-threaded pagetable walker The performance degradation comes from 2 things 1. many outstanding requests at the PTW hardware. Solution: make it multithreaded

Bottleneck 2: High miss rate Large 128 entry TLB doesn’t help Many address streams Need low latency Solution: Shared page-walk cache 2. many misses no matter what you do Solution: reduce miss latency with PWC

Highly-threaded Page table walker GPU MMU Design Lane Shared Memory Coalescer TLB Shared page walk unit Highly-threaded Page table walker Page walk cache Coalescer Lane Shared Memory TLB

Performance: Low overhead Average: Less than 2% slowdown Worst case: 12% slowdown 28

Translation for 100s of GPU Lanes Consistent pointers Supporting x86-64 Address Translation for 100s of GPU Lanes Shared virtual memory is important Non-exotic MMU design Post-coalescer L1 TLBs Highly-threaded page table walker Page walk cache Full compatibility with minimal overhead Still room to optimize High bandwidth address translation: Note: Focus on proof-of-concept not highest possible performance High bandwidth cache coherence

Consistent pointers Data movement Supporting x86-64 Address Translation for 100s of GPU Lanes Data movement Heterogeneous System Coherence High bandwidth address translation: Note: Focus on proof-of-concept not highest possible performance High bandwidth cache coherence [HPCA 2014] [MICRO 2014]

Legacy Interface CPU writes memory CPU initiates DMA GPU direct access GPU Core GPU Core GPU Core GPU Core CPU Core CPU Core CPU Core CPU Core CPU writes memory CPU initiates DMA GPU direct access High bandwidth No directory access L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L1 L1 L1 L1 GPU Core GPU Core GPU Core GPU Core Directory Inv Memory

CC Interface CPU writes memory GPU access Bottleneck: Directory GPU Core GPU Core GPU Core GPU Core CPU Core CPU Core CPU Core CPU Core CPU writes memory GPU access Bottleneck: Directory 1. Access rate 2. Buffering L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L1 L1 L1 L1 GPU Core GPU Core GPU Core GPU Core Inv Directory Key benefit: Use on-chip communication when possible. Copy time can be up to 98% of the runtime for a simple scan operation. Memory

Directory Bottleneck 1: Access rate Many requests per cycle Difficult to design multi-ported directory

Directory Bottleneck 2: Buffering Must track many outstanding requests Huge queuing delays Solution: Reduce pressure on directory

Only permission traffic HSC Design GPU Core GPU Core GPU Core GPU Core CPU Core CPU Core CPU Core CPU Core Goal: Direct access (B/W) + Cache coherence Add: Region Directory Region Buffers Decouples permission from access Only permission traffic L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L1 L1 L1 L1 GPU Core GPU Core GPU Core GPU Core Region Buffer Region Buffer Region Directory Memory

HSC: Performance Improvement

Heterogeneous System Coherence Data movement Heterogeneous System Coherence Want cache coherence without sacrificing bandwidth Major bottlenecks in current coherence implementations 1. High bandwidth difficult to support at directory 2. Extreme resource requirements Heterogeneous System Coherence Leverages spatial locality Reduces bandwidth and resource requirements by 95% High bandwidth address translation: Note: Focus on proof-of-concept not highest possible performance High bandwidth cache coherence

Increasing specialization Need to program these accelerators Challenges 1. Consistent pointers 2. Data movement 3. Security (Fast) This talk: GPGPUs http://www.anandtech.com/show/4144/lg-optimus-2x-nvidia-tegra-2-review-the-first-dual-core-smartphone/3 *NVIDIA via anandtech.com

Security & tightly-integrated accelerators What if accelerators come from 3rd parties? Untrusted! All accesses via IOMMU Safe Low performance Bypass IOMMU High performance Unsafe Trusted Untrusted CPU Core L1 L2 Accelerator Accelerator TLB TLB L1 L1 IOMMU Memory OS (Protected) Data Process Data

Border control: sandboxing accelerators [MICRO 2015] Solution: Border control Key Idea: Decouple translation from safety Safety + Performance Trusted Untrusted CPU Core L1 L2 Accelerator Accelerator TLB TLB L1 L1 Border Control Border Control Better because you only need 2 bits of data compared to 64 bits. IOMMU Memory OS (Protected) Data Process Data

Goal: Enable programmers to use the whole chip Conclusions Challenges 1. Consistent addresses GPU MMU Design 2. Data movement Heterogeneous System Coherence 3. Security Border Control Goal: Enable programmers to use the whole chip

Contact: Jason Lowe-Power powerjg@cs.wisc.edu Jason Power, Mark D. Hill, David A. Wood Consistent pointers Supporting x86-64 Address Translation for 100s of GPU Lanes [HPCA 2014] Jason Power, Arkaprava Basu*, Junli Gu*, Sooraj Puthoor*, Bradford M. Beckmann*, Mark D. Hill, Steven K. Reinhardt*, David A. Wood Data movement Heterogeneous System Coherence [MICRO 2013] * Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood Security Border Control: Sandboxing Accelerators [MICRO 2015] Contact: Jason Lowe-Power powerjg@cs.wisc.edu cs.wisc.edu/~powerjg I’m on the job market this year! Graduating in Spring

Other work Analytic database + Tightly-integrated GPUs When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big-Data Workloads [BPOE 2016] Towards GPUs being mainstream in analytic processing [DaMoN 2015] Implications of Emerging 3D GPU Architecture on the Scan Primitive [SIGMOD Rec. 2015] Jason Lowe-Power, Mark D. Hill, David A. Wood Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, David A. Wood gem5-gpu: A Heterogeneous CPU-GPU Simulator [CAL 2014] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, David A. Wood Simulation Infrastructure

Comparison to CAPI/OpenCAPI Same virtual address space Cache coherent System safety from accelerator Assumes on-chip accel. Allows accel. physical caches Allows pre-translation CAPI Yes No My work Allows for high-performance accelerator optimizations

Detailed HSC Performance

Highly-threaded Page table walker Lane Shared Memory Coalescer TLB Shared page walk unit Highly-threaded Page table walker Page walk cache Coalescer Lane Shared Memory TLB

Other accelerators First step: CPU-GPU What about other accelerators? ISP, Always-on Sensors Many emerging accelerators Neuromorphic Database These also need a coherent and consistent view of memory! GPU is a stress test

Highly-thread PTW Design