Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Multi-GPU System Design with Memory Networks
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
1 Utilizing Field Usage Patterns for Java Heap Space Optimization Z. Guo, N. Amaral, D. Szafron and Y. Wang Department of Computing Science University.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
PERSISTENT MEMORY NVM FAST BYTE ADDR NONVOLATILE.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
ThyNVM Enabling Software-Transparent Crash Consistency In Persistent Memory Systems Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Martin Kruliš by Martin Kruliš (v1.0)1.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Debuggers. Errors in Computer Code Errors in computer programs are commonly known as bugs. Three types of errors in computer programs –Syntax errors –Runtime.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.
Single Instruction Multiple Threads
A Case for Toggle-Aware Compression for GPU Systems
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Gwangsun Kim, Jiyun Jeong, John Kim
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
Concurrent Data Structures for Near-Memory Computing
Mosaic: A GPU Memory Manager
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Fine-Grained DRAM: Energy Efficient DRAM for Extreme Bandwidth Systems
Accelerating MapReduce on a Coupled CPU-GPU Architecture
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/14/2013
Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Hasan Hassan, Nandita Vijaykumar, Samira Khan,
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Genome Read In-Memory (GRIM) Filter:
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Application Slowdown Model
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Reducing DRAM Latency via
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Why Events Are a Bad Idea (for high concurrency servers)
TensorFlow: A System for Large-Scale Machine Learning
Design of Digital Circuits Lab 8 Supplement: Full System Integration
Operating System Overview
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Run-time environments
A Case for Richer Cross-layer Abstractions:
Presentation transcript:

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, Nandita Vijaykumar, Onur Mutlu, Stephen W. Keckler

Opportunity 3D-stacked memory (memory stack) SM (Streaming Multiprocessor) Logic layer Logic layer SM Main GPU Crossbar switch Vault Ctrl …. Vault Ctrl Processing data directly in 3D-stacked memories is a promising direction

significant programmer effort The Problem 3D-stacked memory (memory stack) SM (Streaming Multiprocessor) Logic layer Logic layer SM Main GPU Crossbar switch Vault Ctrl …. Vault Ctrl However, it requires significant programmer effort

Key Challenge 1 3D-stacked memory (memory stack) Main GPU SM (Streaming Multiprocessor) Logic layer Logic layer SM Main GPU Crossbar switch Vault Ctrl …. Vault Ctrl

Key Challenge 1 Challenge 1: Which operations should be executed on the logic layer SMs? ? 3D-stacked memory (memory stack) SM (Streaming Multiprocessor) ? Logic layer Logic layer SM Main GPU Crossbar switch Vault Ctrl …. Vault Ctrl

Key Challenge 2 Challenge 2: How should data be mapped to different 3D memory stacks? 3D-stacked memory (memory stack) SM (Streaming Multiprocessor) Logic layer Logic layer SM Main GPU Crossbar switch Vault Ctrl …. Vault Ctrl

Our Approach: TOM Component 1: A new programmer-transparent mechanism to identify and decide what code portions to offload

Our Approach: TOM Component 1: A new programmer-transparent mechanism to identify and decide what code portions to offload The compiler identifies code portions to potentially offload based on memory profile.

Our Approach: TOM Component 1: A new programmer-transparent mechanism to identify and decide what code portions to offload The compiler identifies code portions to potentially offload based on memory profile. The runtime system decides whether or not to offload each code portion based on runtime characteristics.

Our Approach: TOM Component 1: A new programmer-transparent mechanism to identify and decide what code portions to offload The compiler identifies code portions to potentially offload based on memory profile. The runtime system decides whether or not to offload each code portion based on runtime characteristics. Component 2: A new, simple, programmer-transparent data mapping mechanism to maximize code/data co-location

Our Approach: TOM Component 1: A new programmer-transparent mechanism to identify and decide what code portions to offload The compiler identifies code portions to potentially offload based on memory profile. The runtime system decides whether or not to offload each code portion based on runtime characteristics. Component 2: A new, simple, programmer-transparent data mapping mechanism to maximize code/data co-location Key Results: 30% average (76% max) performance improvement in GPU workloads

Transparent Offloading and Mapping (TOM) Talk at Monday 2:50pm (Session 3B) Transparent Offloading and Mapping (TOM) Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, Nandita Vijaykumar, Onur Mutlu, Stephen W. Keckler