How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.

Slides:

Advertisements

Similar presentations

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Advertisements

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike.

Understanding a Problem in Multicore and How to Solve It

COT 4600 Operating Systems Spring 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00-6:00 PM.

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

Using one level of Cache:

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

CS533 Concepts of Operating Systems Class 6 The Duality of Threads and Events.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

LRU Replacement Policy Counters Method Example

My view of challenges faced by Open64 Xiaoming Li University of Delaware.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Providing QoS with Virtual Private Machines Kyle J. Nesbit, James Laudon, and James E. Smith.

Mike Vermeulen, AMD What and How can the Open64 community collaborate more closely?

Micro 2012 Closing Remarks Onur Mutlu PC Chair December 3, 2012 Vancouver, BC, Canada.

Computer Architecture Lecture 26 Fasih ur Rehman.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Operating Systems CMPSC 473 Lecture 8: Threads September Instructor: Bhuvan Urgaonkar.

Divergence-Aware Warp Scheduling

Cache-Conscious Wavefront Scheduling

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Local Memory optimizations

Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor

Managing GPU Concurrency in Heterogeneous Architectures

Why we use banked Instruction Cache

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Open Systems Architecture Committee

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Microprocessors Chapter 4.

Presented by: Isaac Martin

Some challenges in heterogeneous multi-core systems

High Performance Computing (CS 540)

© 2002, Mike Murach & Associates, Inc.

9-4 Operations with Functions

Operating Systems Case Study

Major Topics in Operating Systems

EE 4xx: Computer Architecture and Performance Programming

Operation of the Basic SM Pipeline

Final Review CSE 421/521 B.Ramamurthy 4/5/2019 B.Ramamurthy.

Final Review CSE 421/521 B.Ramamurthy 4/16/2019 B.Ramamurthy.

Final Review CSE 421/521 B.Ramamurthy 5/1/2019 B.Ramamurthy.

Module IV Memory Organization.

This module covers the following topics.

Module IV Memory Organization.

Final Review CSE 421/521 B.Ramamurthy 5/11/2019 B.Ramamurthy.

9-4 Operations with Functions

Research: Past, Present and Future

Presented by Ondrej Cernin

CSC3150 – Operating Systems

Presentation transcript:

How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory system

Thread Scheduler Cache System Better hit rate than optimal replacement with other schedulers Fix Your Replacement Policy! Feedback! Access Thread 0 Access Thread 1 Thread 2 Thread 3 Access Thread 0 Access Thread 0 63% performance improvement!

Timothy G. Rogers 1, Mike O’Connor 2, Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research Cache-Conscious Wavefront Scheduling Today 3:30pm Right Here in the Columbia Ballroom