Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Slides:



Advertisements
Similar presentations
Hierarchical Caching and Prefetching for Continuous Media Servers with Smart Disks By:Amandeep Singh Parth Kushwaha.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Computer Organization and Architecture Lecture 1 : Introduction
Backprojection Project Update January 2002
ESE532: System-on-a-Chip Architecture
Computer Architecture Chapter (14): Processor Structure and Function
Dual-Channel Architecture
Reducing Hit Time Small and simple caches Way prediction Trace caches
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Improving cache performance of MPEG video codec
Lecture 15: DRAM Main Memory Systems
Stream Architecture: Rethinking Media Processor Design
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
A Talk on Adaptive History-Based Memory Scheduling
Applying SVM to Data Bypass Prediction
Computer Organization & Architecture 3416
Memory System Performance Chapter 3
6- General Purpose GPU Programming
Presentation transcript:

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter MattsonJohn D. Owens Computer systems Laboratory Stanford University Stanford CA Kamlesh Raiter Electrical and Computer Engineering University of Alberta

Abstract The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.

Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

Introduction Modern computer systems are becoming increasingly limited by memory performance. While processor performance increases at a rate of 60% per year, the bandwidth of a memory chip increases by only 10% per year making it costly to provide the memory bandwidth required to match the processor performance. The memory bandwidth Bottleneck is even more acute for media processors with streaming memory reference patterns that do not cache well. Without an effective cache to reduce the bandwidth demands on main memory, these media processors are more often limited by memory system bandwidth than other computer systems.

This paper introduces memory access scheduling in which DRAM operations are scheduled, possibly completing memory references out of order, to optimize memory system performance. The several memory access scheduling strategies introduced in this paper increase the sustained memory bandwidth of a system by up to 144% over a system with no access scheduling when applied to realistic synthetic benchmarks. Media processing applications exhibit a 30% improvement in sustained memory bandwidth with memory access scheduling, and the traces of these applications offer a potential bandwidth improvement of up to 93%.

What is Memory Access Scheduling ? Modern DRAM Architecture Three Dimensional Structure Bank Row Column

Memory Access system Three steps in Accessing the Memory Data Precharge (3 cycles) Row Access(3 cycles) Column Access(1 cycle) Once a row has been accessed, a new column access can issue each cycle until the bank is precharged.

References are initially sorted by DRAM bank. Each pending reference is represented by six fields: valid (V), load/store (L/S), address (Row and Col), data, and whatever additional state is necessary for the scheduling algorithm. Examples of state that can be accessed and modified by the scheduler are the age of the reference and whether or not that reference targets the currently active row. The precharge manager simply decides when its associated bank should be precharged. Similarly,the row arbiter for each bank decides which row, if any,should be activated when that bank is idle. A single column arbiter is shared by all the banks. The precharge managers, row arbiters, and column arbiter can use several different policies to select DRAM operations,. The combination of policies used by these units, along with the address arbiter’s policy,determines the memory access scheduling algorithm.

Experimental Setup In a stream (or vector) processor, the stream transfer bandwidth, rather than the latency of any individual memory reference, drives processor performance. To evaluate the performance impact of memory access scheduling on media processing, a streaming media processor Imagine was simulated running typical media processing applications. For the simulations, it was assumed that the processor frequency was 500 MHz and that the DRAM frequency was 125 MHz.3 At this frequency, Imagine has a peak computation rate of 20GFLOPS on single precision floating point computations and 20GOPS on 32-bit integer computations.

Experimental Setup The experiments were run on a set of microbenchmarks and five media processing applications. For the microbenchmarks, no computations are performed outside of the address generators.This allows memory references to be issued at their maximum throughput, constrained only by the buffer storage in the memory banks. For the applications, the simulations were run both with the applications’ computations and without. When running just the memory traces, dependencies were maintained by assuming the computation occurred at the appropriate times but was instantaneous.

The 14% drop in sustained bandwidth from the unit load benchmark to the unit benchmark shows the performance degradation imposed by forcing intermixed load and store references to complete in order. The unit conflict benchmark further shows the penalty of swapping back and forth between rows in the DRAM banks, which drops the sustainable bandwidth down to 51% of the peak. The random benchmarks sustain about 15% of the bandwidth of the unit load benchmark.

The QRD and MPEG traces include many unit and small constant stride accesses, leading to a sustained bandwidth that approaches that of the unit benchmark. The Depth trace consists almost exclusively of constant stride accesses, but dependencies limit the number of simultaneous stream accesses that can occur. The FFT trace is composed of constant stride loads and bit-reversed stores.The bit-reversed accesses sustain less bandwidth than constant stride accesses. Tex trace includes constant stride accesses, but is dominated by texture accesses which are essentially random within the texture memory space. These texture accesses lead to the lowest sustained bandwidth of the applications.

Comparison between In-order and First-ready scheduling The sustained bandwidth is increased by 79% for the microbenchmarks, 17% for the applications, and 40% for the application traces. As should be expected, unit load shows little improvement as it already sustains almost all of the peak SDRAM bandwidth, and the random benchmarks show an improvement of over 125%, as they are able to increase the number of column accesses per row activation significantly

The Figure presents the sustained memory bandwidth for each memory access scheduling algorithm on the given benchmarks. These aggressive scheduling algorithms improve the memory bandwidth of the microbenchmarks by %, the applications by 27-30%, and the application traces by 85-93% over in-order scheduling.

Conclusion Memory bandwidth is becoming the limiting factor in achieving higher performance, especially in media processing systems. Memory access scheduling greatly increases the bandwidth utilization of these DRAMs by buffering memory references and choosing to complete them in an order that both accesses the internal banks in parallel and maximizes the number of column accesses per row access, resulting in improved system performance.

Conclusion Memory access scheduling realizes significant bandwidth gains on a set of media processing applications as well as on synthetic benchmarks and application address traces. A simple reordering algorithm that advances the first ready memory reference gives a 17% performance improvement on applications, a 79% bandwidth improvement for the microbenchmarks, and a 40% bandwidth improvement on the application traces. Bandwidth for synthetic benchmarks improved by 144%, performance of the media processing applications improved by 30%, and the bandwidth of the application traces increased by 93%.

Things to Ponder How about the cost of implementing the Memory Access Scheduler? How about the extra logic and the power dissipation? Why contemporary cache organizations waste memory bandwidth to reduce memory latency ?