1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Sungjun Kim Columbia University Edward A. Lee UC Berkeley

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

A Case for Refresh Pausing in DRAM Memory Systems

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.

Survey of Existing Memory Devices Renee Gayle M. Chua.

1 Reducing Queue Lock Pessimism in Multiprocessor Schedulability Analysis Yang Chang, Robert Davis and Andy Wellings Real-time Systems Research Group University.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

Main Memory CS448.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Chapter 4 Memory Design: SOC and Board-Based Systems

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Revision Mid 2, Cache Prof. Sin-Min Lee Department of Computer Science.

Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

5-1 ECE 424 Design of Microprocessor-Based Systems Haibo Wang ECE Department Southern Illinois University Carbondale, IL

RAM RAM - random access memory RAM (pronounced ramm) random access memory, a type of computer memory that can be accessed randomly;

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

An operating system for a large-scale computer that is used by many people at once is a very complex system. It contains many millions of lines of instructions.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Computer Architecture Lecture 25 Fasih ur Rehman.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Reducing Memory Interference in Multicore Systems

EEE Embedded Systems Design Process in Operating Systems 서강대학교 전자공학과

Reducing Hit Time Small and simple caches Way prediction Trace caches

SystemC Simulation Based Memory Controller Optimization

A Requests Bundling DRAM Controller for Mixed-Criticality System

Lecture 15: DRAM Main Memory Systems

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

BIC 10503: COMPUTER ARCHITECTURE

Lecture: Memory Technology Innovations

Operating systems Process scheduling.

Lecture 22: Cache Hierarchies, Memory

15-740/ Computer Architecture Lecture 19: Main Memory

Presentation transcript:

1 Presented By: Michael Bieniek

Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities. Shared main memory becoming a significant bottleneck. There is a need to bound worst case memory latency to provide hard guarantees to real-time tasks. 2 Introduction Introduction Context

Researchers have already proposed many different analyses that assume constant time for each memory request. Modern CMPs use Double Data Rate Dynamic RAM (aka DDR DRAM) as their main memory. Constant times for memory requests cannot be used for this type of memory as the latency of each request is highly variable on previous requests and states. Commercial-Off-The-Shelf (COTS) memory controllers could have unbounded latency! 3 Introduction Introduction Context

Introduction Introduction Paper Contributions A modified memory controller to allow for bounded memory requests. Derived worst case memory latency for DDR DRAM. Evaluation of the memory system compared to alternate approaches. 4

DRAM Background DRAM Background Organization 5

DRAM Background DRAM Background Open and Close Requests Open Request: requested data is already cached in the row buffer. Row hit! Closed Request: requested data is not cached in the row buffer. Row miss! What happens if you have an open request? R/W [CAS] What happens if you have an closed request? PRE, ACT, R/W [CAS] 6

DRAM Background DRAM Background Refresh DRAM storage elements contain capacitors which must be recharged periodically. A Refresh (REF) command must be issued to all ranks and banks occasionally. 7

DRAM Background DRAM Background Timing Constraints A JEDEC standard exists for defining operation and timing constraints of memory devices. 8

DRAM Background DRAM Background Row Policy Open Row Policy: leave the row buffer open as long as possible. Closed Row Policy: automatically precharges the row buffer after every request. All requests will have an ACT/CAS command, thus predictable latency. 9

DRAM Background DRAM Background Row Mapping Incoming requests must be mapped to the correct bank, row, and column. Interleaved banks: each request access all banks. More BW but each requestor must utilize all banks in parallel. Private banks: each requestor is assigned its own bank. No interference from other requester but less BW. 10

Memory Controller Memory Controller Design 11

Memory Controller Memory Controller FIFO Arbitration Rules Only one command from each private buffer can be in the FIFO. Must wait for it to be serviced before enqueuing more. A command can only be enqueued if all timing constraints that are caused by previous commands from the same requester are met. At the start of a memory cycle, the FIFO queue is scanned front to end and the first command that can be serviced is issued. If one CAS command is blocked due to timing constraints, all other CAS commands in the FIFO will also be blocked. 12

Memory Controller Memory Controller Rule 4 – Needed to prevent unbounded latency 13

Worst Case Analysis Worst Case Analysis System Model The analysis assumes a system with M memory requestors and an in-order memory requests. In order to model latency, the number of each request type and the order in which requests are generated. How do we get the above info? Run the task and get a memory trace or static analysis. 14

Worst Case Analysis Worst Case Analysis Per Request Latency – Arrival to CAS Arrival to CAS is the worst case interval between the arrival of a request to the memory controller and the subsequent enqueuing of the CAS command. 15

Worst Case Analysis Worst Case Analysis Per Request Latency – Arrival to CAS, Open Request In an open request, the memory request is just a single CAS command because the row is already open. This means that t AC contains the latency of the timing constraints of previous commands. 16

Worst Case Analysis Worst Case Analysis Per Request Latency – Arrival to CAS, Close Request 17

Worst Case Analysis Worst Case Analysis Per Request Latency – CAS to Data, Write Worst case interference is when all M-1 requestors enqueue a CAS command in the global FIFO in an alternating pattern of read and write commands. 18

Worst Case Analysis Worst Case Analysis Per Request Latency – CAS to Data, Read Worst case interference is when all M-1 requestors enqueue a CAS command in the global FIFO in an alternating pattern of read and write commands. Same as the write case except the last read command contributes diff latency. 19

Worst Case Analysis Worst Case Analysis Cumulative Latency t AC: 20

Worst Case Analysis Worst Case Analysis Refresh, Overall Latency Refresh commands are issued periodically and must be considered as it is a time in which the DRAM cannot service any requests. Added complexity because all row buffers are closed, this turns open requests into close requests. In order to determine how many open requests turn to close requests, need to compute how many refresh operations can take place during the execution of a task. Task execution depends on cumulative memory latency, circular dependency! 21

Evaluation Evaluation Setup Compared against the Analyzable Memory Controller (AMC) Fair comparison as AMC also utilizes a fair round robin arbitration system which does not prioritize requestors. AMC uses interleaved banks. AMC uses close row policy. 22

Evaluation Evaluation Synthetic Benchmark 23

Evaluation Evaluation CHStone Benchmark 24

Conclusion Proposed approach utilizing an open row policy and private bank mapping appear to provide a better worst case bound for memory latency. Evaluation results show that the proposed method scales better with increasing number of requestors and is more suited for current and future generations of memory devices as memory speed increases. The proposed approach minimizes the amount of wasteful data transferred. 25

Q&A 26