Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Fast Buffer Memory with Deterministic Packet Departures Mayank Kabra, Siddhartha Saha, Bill Lin University of California, San Diego.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Design of a High-Throughput Distributed Shared-Buffer NoC Router

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

GPGPU platforms GP - General Purpose computation using GPU

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.

McRouter: Multicast within a Router for High Performance NoCs

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb Alpha Development Group, Compaq HOT Interconnects 9 (2001) Presented by.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Yu Cai Ken Mai Onur Mutlu

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Computer Engg, IIT(BHU)

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Lecture 23: Interconnection Networks

ISPASS th April Santa Rosa, California

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

CS427 Multicore Architecture and Parallel Computing

Exploring Concentration and Channel Slicing in On-chip Network Router

Graphics Processing Unit

Static and Dynamic Networks

Lecture 15: DRAM Main Memory Systems

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Graphics Processing Unit

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Graphics Processing Unit

CSE 502: Computer Architecture

Presentation transcript:

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang

Introduction: GPGPU Multiply Shader Core (Nvidia’s Kepler already has more than 2000 shader cores) Massive threads run parallel Arithmetic intensity (very good at float point calculation, high GFLOPS) API supports: OpenCL, Nvidia CUDA, graphic shading language CUDA example: https://www.youtube.com/watch?v=K1I4kts5mqc

Graphic Pipeline: Vertex Processing (shader core) -> Rasteration (Fix Function Unit)->Pixel Processing (share core) GPGPU takes advantage of powerful of shader core

Problem of GPGPU Require very high bandwidth Multiple Shader cores destroy “row access locality” That’s why bandwidth is always the topic for new gpu architecture

Introduction: DDR Each DDR has 4 banks Each bank has multiply rows and multiply columns Accessing row must be in row buffers

DDR cannot always response immediately

reasons for research The gap between the pre-interconnect and post-interconnect of Row Access Locality, especially when the number of shader cores becomes larger

Reasons for research FR-FCFS has high performance but area-intensive and may need to compare all requests in the DRAM controller queue every cycle

reasons for research fewer DDR per controller is more effective as number of controller increases, reduce the number of read/write commands per activated row for each chip with low row access locality, the DRAM efficiency can reduce greatly

why need out of order scheduling

Proposed design Using Simple in-order memory controllers Using “smart” NoC To achieve the same high performance as FR-FCFS

Interconnect architecture Crossbar Network Mesh Ring Network

Row Access Locality decrease by almost 47% on average (between pre-interconnect and post-interconnect)

Two Arbitrations to keep row access locality Hold Grant (HG): if a pair of I and O win in the last cycle, they win again in current cycle again Row-Marching Hold Grant (RMHG): if a pair of I and O win in the last cycle, they only win again if the row matches the again (lost inherent in bank-level parallelism)

Two stages to apply HG virtual channel allocation (request from the same shader core and to same DRAM must assign the same virtual channel in each router, performance increase 17.7%) switch allocation (prevent from undoing the effect of virtual channel allocation)

parallel iterative matching (PIM) allocation 1 input arbiter select from a single request per input port 2 output of the first stage of input arbiters is then fed to the output arbiters (choose output request)

Compare comparison RULE the amount of storage the amount of bit-comparison

microarchitecture parameters to evaluate the design

benchmark to evaluate

row access locality preservation with different topologies HG and HMHG preserve 70% of row access locality FR-FCFS only keep 56.2%

Average latency and DRAM efficiency Reduce latency of in-order DRAM controller by more than 30% with HM and HMHG improve efficiency by 15%

GAP between HM/HMHG and FR-FCFS row streak breakers dominated from different shader core

virtual channel configurations static virtual channel has better performance, improve by about 18% more virtual channel does not improve IPC

sensitivity analysis Ring has poor performance HG and HMHG has relative good performance with simple DRAM scheduling

sensitive analysis increase number of DRAM controller, the HG/HMHG performance is steadily compared to FR-FCFS, but complexity is not change

related work PAR-BS: highly complex and area intensive for tens of thousands of threads Most of the application always send requests to only a single bank per warp — our design can also do this Only one application always to at least 3 different DRAM controller, the PAR-BS expected to have poor performance

Conclusion Achieve 90% of performance compared to aggressive out- of-order scheduling Three modification on NoC : HG, RMHG, HMHG allow simple DRAM controller to achieve high performance

Thank You!