Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Understanding a Problem in Multicore and How to Solve It

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

CIS-550 Computer Architecture Lecture 1: Introduction and Basics

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Improving Cache Performance using Victim Tag Stores

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

CS427 Multicore Architecture and Parallel Computing

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.

Managing GPU Concurrency in Heterogeneous Architectures

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Morgan Kaufmann Publishers

Many-core Software Development Platforms

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Application Slowdown Model

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

Planning and Cabling Networks

MASK: Redesigning the GPU Memory Hierarchy

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Chapter 5 Memory CSE 820.

Application Slowdown Model

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

EE 4xx: Computer Architecture and Performance Programming

15-740/ Computer Architecture Lecture 14: Prefetching

Implementation support

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

A Case for Interconnect-Aware Architectures

Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Case for Richer Cross-layer Abstractions:

Presented by Florian Ettinger

Implementation support

Presentation transcript:

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu The Locality Descriptor A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs ISCA 2018 Nandita Vijaykumar Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Data locality is critical to GPU performance Core Core Core Core Memory Memory L1 L2 Core Core Memory Memory Memory Cache Locality NUMA Locality

Exploiting data locality in GPUs is a challenging and elusive feat…

…requiring a range of architectural techniques Cache Management CTA Scheduling Data Placement Data Prefetching …

Furthermore… A single technique is often insufficient The required set of techniques depends on the program Program B Cache Management CTA Scheduling Program A Data Placement Data Prefetching …

Challenging for the programmer/software No easy access to many architectural techniques Tedious and un-portable programming: Bypass Cache Line A Schedule Thread Block 2 at SM 1 …

Challenging for the architect Hardware misses key program semantics required for optimization Where to place data? Which threads to schedule together? Which data to bypass?

The Locality Descriptor A hardware-software abstraction to express and exploit data locality Connects locality semantics to the underlying hardware techniques Application New software interface Software Locality Descriptor Access to key program semantics Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

A Sneak Peak Key Performance Results: Address Range Data Structure Locality Type LocalityDescriptor ldesc(A, len, INTER-THREAD, tile, loc, priority); LocalityDescriptor ldesc(); Locality Semantics Tile Semantics Priority Key Performance Results: Leveraging Cache Locality: 26.6% on average (up to 46.6%) Leveraging NUMA Locality: 53.7% (up to 2.8X)

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu The Locality Descriptor A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs ISCA 2018 Nandita Vijaykumar Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu