Sungjun Kim Columbia University Edward A. Lee UC Berkeley

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.

Recent Progress In Embedded Memory Controller Design

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

Predictable Programming on a Precision Timed Architecture Hiren D. Patel UC Berkeley Joint work with: Ben Lickly, Isaac Liu, Edward.

Timing Analysis of Embedded Software for Families of Microarchitectures Jan Reineke, UC Berkeley Edward A. Lee, UC Berkeley Representing Distributed Sense.

CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.

Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

8th Biennial Ptolemy Miniconference Berkeley, CA April 16, 2009 Precision Timed (PRET) Architecture Hiren D. Patel, Ben Lickly, Isaac Liu and Edward A.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.

Registers –Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.

The Case for Precision Timed (PRET) Machines Edward A. Lee Professor, Chair of EECS UC Berkeley With thanks to Stephen Edwards, Columbia University. National.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Cyber-Physical Systems: A Vision of the Future Edward A. Lee Robert S. Pepper Distinguished.

February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.

Chapter 6 Memory and Programmable Logic Devices

CSIT 301 (Blum)1 Memory. CSIT 301 (Blum)2 Types of DRAM Asynchronous –The processor timing and the memory timing (refreshing schedule) were independent.

Memory Technology “Non-so-random” Access Technology:

Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.

CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and

Survey of Existing Memory Devices Renee Gayle M. Chua.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Lecture 19 Today’s topics Types of memory Memory hierarchy.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)

Main Memory CS448.

CS 312 Computer Architecture Memory Basics Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki

CPEN Digital System Design

University of Tehran 1 Interface Design DRAM Modules Omid Fatemi

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Computer Architecture Lecture 24 Fasih ur Rehman.

Shih-Fan, Peng 2013 IEE5008 –Autumn 2013 Memory Systems DRAM Controller for Video Application Shih-Fan, Peng Department of Electronics Engineering National.

Chapter 4 Memory Design: SOC and Board-Based Systems

COMP541 Memories II: DRAMs

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Memory Hierarchy (I). 2 Outline Random-Access Memory (RAM) Nonvolatile Memory Disk Storage Suggested Reading: 6.1.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

February 12, 2009 Center for Hybrid and Embedded Software Systems Timing-aware Exceptions for a Precision Timed (PRET)

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (

“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.

15-740/ Computer Architecture Lecture 25: Main Memory

CS 704 Advanced Computer Architecture

COMP541 Memories II: DRAMs

Chapter 1: Introduction

Reducing Memory Interference in Multicore Systems

Reducing Hit Time Small and simple caches Way prediction Trace caches

Cache Memory Presentation I

Sven Goossens Slides: Benny Åkesson

A Precision Timed Architecture for Predictable and Repeatable Timing

EE345: Introduction to Microcontrollers Memory

Hiren D. Patel Isaac Liu Ben Lickly Edward A. Lee

Timing-aware Exceptions for a Precision Timed (PRET) Target

15-740/ Computer Architecture Lecture 19: Main Memory

DRAM Hwansoo Han.

Presentation transcript:

PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation Sungjun Kim Columbia University Edward A. Lee UC Berkeley Isaac Liu UC Berkeley Hiren D. Patel University of Waterloo Jan Reineke UC Berkeley <speaker> CODES+ISSS as part of ESWEEK 2011 Taipei, Taiwan, October 10th, 2011

Predictability and Temporal Isolation Many embedded systems are real-time systems Memory hierarchy has a strong influence on their performance: Need for Predictability Trend towards integrated architectures: Need for Temporal Isolation Audio + video playback with latency and bandwidth constraints

Outline Introduction DRAM Basics Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

DRAM Basics Outline Introduction Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

Memory Hierarchy: Dynamic RAM vs Static RAM SRAM Fast  Low Latency Low Capacity DRAM Slow  High Latency High Capacity from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.

Dynamic RAM Organization Overview DRAM Device Set of DRAM banks + Control logic I/O gating Accesses to banks can be pipelined, however I/O + control logic are shared DRAM Cell Leaks charge  Needs to be refreshed (every 64ms for DDR2/DDR3) therefore “dynamic” DRAM Bank = Array of DRAM Cells + Sense Amplifiers and Row Buffer Sharing of sense amplifiers and row buffer DRAM Module Collection of DRAM Devices Rank = groups of devices that operate in unison Ranks share data/address/command bus

DRAM Memory Controller Translates sequences of memory accesses by Clients (CPUs and I/O) into legal sequences of DRAM commands Needs to obey all timing constraints Needs to insert refresh commands sufficiently often Needs to translate “physical” memory addresses into row/column/bank tuples

Dynamic RAM Timing Constraints DRAM Memory Controllers have to conform to different timing constraints that define minimal distances between consecutive DRAM commands. Almost all of these constraints are due to the sharing of resources at different levels of the hierarchy: Needs to insert refresh commands sufficiently often Banks within a DRAM device share I/O gating and control logic Rows within a bank share sense amplifiers Different ranks share data/address/command busses

General-Purpose DRAM Controllers Schedule DRAM commands dynamically Timing hard to predict even for single client: Timing of request depends on past requests: Request to same/different bank? Request to open/closed row within bank? Controller might reorder requests to minimize latency Controllers dynamically schedule refreshes Non-composable timing. Timing depends on behavior of other clients: They influence sequence of “past requests” Arbitration may or may not provide guarantees

General-Purpose DRAM Controllers Load B1.R3.C2 Load B1.R4.C3 Load B1.R3.C5 … Memory Controller RAS B1.R3 CAS B1.C2 … B1.R4 B1.C3 B1.C5 ? RAS B1.R3 CAS B1.C2 CAS B1.C5 RAS B1.R4 CAS B1.C3 … …

General-Purpose DRAM Controllers Thread 1 Thread 2 Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Arbitration ? Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5 Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5 Memory Controller

Related Work: Predator and AMC Outline Introduction DRAM Basics Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

Predictable DRAM Controllers: Predator (Eindhoven) and AMC (Barcelona) Spread each request over all banks, pipeline accesses to banks. Closed-page policy: timing independent of previously accessed row Statically precomputed sequences for writes, reads, write->read, read->write, refresh. Predictable and/or composable arbitration: Predator: CCSP AMC: TDMA

Predictable DRAM Controllers: Predator (Eindhoven) Load B1.R3.C2 Load B1.R4.C3 Store B1.R3.C5 … Predictable Memory Controller: Predator Read Pattern Read Pattern R/W Pattern Write Pattern Closed-page policy: timing independent of previously accessed row Spread each request over all banks, pipeline accesses to banks. Statically precomputed sequences for writes, reads, write->read, read->write, refresh. increases access granularity

Predictable DRAM Controllers: Predator (Eindhoven) and AMC (Barcelona) Thread 1 Thread 2 Load B1.R3.C2 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 … Predictable and/or Composable Arbitration (e.g. time-division multiple access) ? Load B1.R3.C2 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Memory Controller

PRET DRAM Controller: Main Ideas Outline Introduction DRAM Basics Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

PRET DRAM Controller: Three Innovations Expose internal structure of DRAM devices: Expose individual banks within DRAM device as multiple independent resources Defer refreshes to the end of transactions Allows to hide refresh latency Perform refreshes “manually”: Replace standard refresh command with multiple reads

PRET DRAM Controller: Exploiting Internal Structure of DRAM Module Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent Partition into four groups of banks in alternating ranks Cycle through groups in a time-triggered fashion Successive accesses to same group obey timing constraints Reads/writes to different groups do not interfere Rank 0: Group 0 Group 2 Bank 0 Bank 1 Bank 2 Bank 3 Rank 1: Group 1 Group 3 Provides four independent and predictable resources Bank 0 Bank 1 Bank 2 Bank 3

PRET DRAM Controller: Exploiting Internal Structure of DRAM Module Load B1.R3.C2 Load B1.R4.C3 Store B1.R3.C5 … PRET DRAM Controller Read Pattern Read Pattern Write Pattern …

Pipelined Bank Access Scheme READ WRITE READ

PRET DRAM Controller: “Manual” Refreshes Every row needs to be refreshed every 64ms Dedicated refresh commands refresh one row in each bank at once We replace these with “manual” refreshes through reads Improves worst-case latency of short requests Dedicated refresh commands vs refreshes through reads. (refresh latencies not to scale)

PRET DRAM Controller: Defer Refreshes Refreshes do not have to happen periodically Refresh every row at least every 64 ms Schedule refreshes slightly more often than necessary  Enables to defer refreshes

General-Purpose DRAM Controller vs PRET DRAM Controller General-Purpose Controller PRET DRAM Controller Abstracts DRAM as a single shared resource Schedules refreshes dynamically Schedules commands dynamically “Open page” policy speculates on locality Abstracts DRAM as multiple independent resources Refreshes as reads: shorter interruptions Defer refreshes: improves perceived latency Follows periodic, time-triggered schedule “Closed page” policy: access-history independence

Evaluation Outline Introduction DRAM Basics Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

Conventional DRAM Controller (DRAMSim2) vs PRET DRAM Controller: Latency Evaluation Varying Interference: Varying Transfer Size:

PRET DRAM Controller vs Predator: Analytical Evaluation abstracts DRAM as single resource uses standard refresh mechanism PRET controller improves worst-case access latency of small transfers

PRET DRAM Controller vs Predator: Analytical Evaluation Less of a difference for larger transfers Predator provides slightly higher bandwidth due to more efficient refresh mechanism

Integration into Precision-Timed ARM Outline Introduction DRAM Basics Related Work: Predator and AMC PRET DRAM Controller: Main Ideas Evaluation Integration into Precision-Timed ARM

Precision-Timed ARM (PTARM) Architecture Overview http://chess.eecs.berkeley.edu/pret/ Thread-Interleaved Pipeline for predictable timing without sacrificing high throughput One private DRAM Resource + DMA Unit per Hardware Thread Shared Scratchpad Instruction and Data Memories for low latency access

Conclusions and Future Work Temporal isolation and improved worst-case latency by bank privatization How to program the inverted memory hierarchy? Raffaello Sanzio da Urbino – The Athens School Eh eh eh.... A little classic lecture -:)... The School represents the clash of the two views of Plato and Aristoteles about the phylosophical foundations... Plato is for the abstract... Aristoteles is for the immanent... Plato reduces everything to a general "metamodel" (the world of ideas) where the real "things" are refinements of the general metamodel. Aristoteles is for the induction process where the abstraction takes place from the real world and moves up... Ideally platform based design is the meeting of the two wolds (top down- bottom up). The school of Athens can be seen as our groups where people debate about the nature of design. Our common view seems to merge the two worlds... The models of computation crowd (e.g., you and me) who like to see things at higher levels of abstraction, the implementation crowd (e.g. Kurt) sees the need of the application focus. I believe we as a team do both and believe in the both (the vision!). Sorry for the long winded explanation... i do have a ball explaining this to people while giving talks....-:) Finally my 8 years of philosophy pay out!!!!!!!!!!!!!!! Ciao Alberto

References Related Work on Memory Controllers: M. Paolieri, E. Quiñones, F. Cazorla, and M. Valero, “An analyzable memory controller for hard real-time CMPs,” IEEE Embedded Systems Letters, vol. 1, no. 4, pp. 86–90, 2010. B. Akesson, K. Goossens, and M. Ringhofer, “Predator: a predictable SDRAM memory controller,” in CODES+ISSS. ACM, 2007, pp. 251–256. Work within the PRET project: [CODES ’11] Jan Reineke, Isaac Liu, Hiren D. Patel, Sungjun Kim, Edward A. Lee, PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation, International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October, 2011. [DAC ‘11] Dai Nguyen Bui, Edward A. Lee, Isaac Liu, Hiren D. Patel, Jan Reineke, Temporal Isolation on Multiprocessing Architectures, Design Automation Conference (DAC), June, 2011. [Asilomar ‘10] Isaac Liu, Jan Reineke, and Edward A. Lee, PRET Architecture Supporting Concurrent Programs with Composable Timing Properties, in Signals, Systems, and Computers (ASILOMAR), Conference Record of the Forty Fourth Asilomar Conference, November 2010, Pacific Grove, California. [CASES ’08] Ben Lickly, Isaac Liu, Sungjun Kim, Hiren D. Patel, Stephen A. Edwards and Edward A. Lee, "Predictable Programming on a Precision Timed Architecture," in Proceedings of International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Piscataway, NJ, pp. 137-146, IEEE Press, October, 2008.