Amirhossein Mirhosseini

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

High Performing Cache Hierarchies for Server Workloads
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
SizeCap: Efficiently Handling Power Surges for Fuel Cell Powered Data Centers Yang Li, Di Wang, Saugata Ghose, Jie Liu, Sriram Govindan, Sean James, Eric.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
A Case for Toggle-Aware Compression for GPU Systems
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Reducing OLTP Instruction Misses with Thread Migration
15-740/ Computer Architecture Lecture 3: Performance
CS427 Multicore Architecture and Parallel Computing
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
BD-CACHE Big Data Caching for Datacenters
Mosaic: A GPU Memory Manager
Managing GPU Concurrency in Heterogeneous Architectures
Parallel Algorithm Design
(Find all PTEs that map to a given PPN)
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Lecture 11: DMBS Internals
Spare Register Aware Prefetching for Graph Algorithms on GPUs
RegLess: Just-in-Time Operand Staging for GPUs
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Presented by: Isaac Martin
Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt
MASK: Redesigning the GPU Memory Hierarchy
Row Buffer Locality Aware Caching Policies for Hybrid Memories
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 5: GPU Compute Architecture for the last time
Address-Value Delta (AVD) Prediction
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Managing GPU Concurrency in Heterogeneous Architectures
Arash Tavakkol, Mohammad Sadrosadati, Saugata Ghose,
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Reducing DRAM Latency via
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Computer System Design Lecture 9
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Mattan Erez The University of Texas at Austin
` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Presented by Ondrej Cernin
Multi-Lookahead Offset Prefetching
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

Amirhossein Mirhosseini LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching Mohammad Sadrosadati Amirhossein Mirhosseini Seyed Borna Ehsani Hamid Sarbazi-Azad Mario Drumond Babak Falsafi Rachata Ausavarungnirun Onur Mutlu

Register file size limits GPU scalability Register file already accounts for 60% of on-chip storage But, there is still demand for more registers to achieve maximum performance and concurrency 5.9x 2.3x (KB)

Register file size limits GPU scalability Register file already accounts for 60% of on-chip storage But, there is still demand for more registers to achieve maximum performance and concurrency 5.9x 2.3x (KB) Unfortunately, all mechanisms to expand RF capacity without large area/power overheads significantly increase access latency Emerging technologies, register file compression/virtualization, etc.

Register file size limits GPU scalability Register file already accounts for 60% of on-chip storage But, there is still demand for more registers to achieve maximum performance and concurrency 5.9x 2.3x (KB) Unfortunately, all mechanisms to expand RF capacity without large area/power overheads significantly increase access latency Emerging technologies, register file compression/virtualization, etc. Goal: Tolerate register file latencies

Contributions (1) Compiler-driven Register Prefetching Break control flow graph into “prefetch subgraphs” Prefetch registers at the beginning of each subgraph Interval analysis to identify prefetch subgraphs R1 R3 R2 P(R1,R2) R1 P(R3) R3 R2 R4 P(R4) R4

Contributions (2) Latency Tolerant Register File (LTRF) “2-level” main register file + register cache Performs prefetch ops while executing other warps Tolerates up to 6x higher RF access latencies Paves the way for various area/power optimizations

Contributions (2) Latency Tolerant Register File (LTRF) “2-level” main register file + register cache Performs prefetch ops while executing other warps Tolerates up to 6x higher RF access latencies Paves the way for various area/power optimizations Example LTRF deployment enables 8× larger register file and 32% higher performance

Amirhossein Mirhosseini LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching Mohammad Sadrosadati Amirhossein Mirhosseini Seyed Borna Ehsani Hamid Sarbazi-Azad Mario Drumond Babak Falsafi Rachata Ausavarungnirun Onur Mutlu

Amirhossein Mirhosseini LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching Mohammad Sadrosadati Amirhossein Mirhosseini Seyed Borna Ehsani Hamid Sarbazi-Azad Mario Drumond Babak Falsafi Rachata Ausavarungnirun Onur Mutlu Please attend our talk at 2:00 in Virginia EF