CS6290 Memory. Views of Memory Real machines have limited amounts of memory –640KB? A few GB? –(This laptop = 2GB) Programmer doesn’t want to be bothered.

Slides:



Advertisements
Similar presentations
CSE502: Computer Architecture Memory / DRAM. CSE502: Computer Architecture SRAM vs. DRAM SRAM = Static RAM – As long as power is present, data is retained.
Advertisements

Chapter 5 Internal Memory
Computer Organization and Architecture
Lecture 12 Reduce Miss Penalty and Hit Time
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 16 Instructor: L.N. Bhuyan
Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Main Memory Background Random Access Memory (vs. Serial Access Memory) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs.
Advanced Microarchitecture
CS333 Intro to Operating Systems Jonathan Walpole.
Random access memory.
Survey of Existing Memory Devices Renee Gayle M. Chua.
Lecture 19: Virtual Memory
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
1 CSCI 2510 Computer Organization Memory System I Organization.
IT253: Computer Organization Lecture 11: Memory Tonga Institute of Higher Education.
Chapter 5 Internal Memory. Semiconductor Memory Types.
IT253: Computer Organization
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use ECE/CS 352: Digital Systems.
Main Memory CS448.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui COMP 203 / NWEN 201 Computer Organisation / Computer Architectures Virtual.
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
Chapter 4 Memory Management Virtual Memory.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
Virtual Memory 1 1.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Semiconductor Memory Types
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS 1410 Intro to Computer Tecnology Computer Hardware1.
Computer Architecture Chapter (5): Internal Memory
CS203 – Advanced Computer Architecture Virtual Memory.
CS161 – Design and Architecture of Computer
CSE 502: Computer Architecture
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
COSC3330 Computer Architecture Lecture 20. Virtual Memory
Reducing Hit Time Small and simple caches Way prediction Trace caches
Lecture 12 Virtual Memory.
Outline Paging Swapping and demand paging Virtual memory.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS 105 “Tour of the Black Holes of Computing!”
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CS 105 “Tour of the Black Holes of Computing!”
CS 105 “Tour of the Black Holes of Computing!”
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Virtual Memory 1 1.
Presentation transcript:

CS6290 Memory

Views of Memory Real machines have limited amounts of memory –640KB? A few GB? –(This laptop = 2GB) Programmer doesn’t want to be bothered –Do you think, “oh, this computer only has 128MB so I’ll write my code this way…” –What happens if you run on a different machine?

Programmer’s View Example 32-bit memory –When programming, you don’t care about how much real memory there is –Even if you use a lot, memory can always be paged to disk Kernel Text Data Heap Stack 0-2GB 4GB AKA Virtual Addresses

Programmer’s View Really “Program’s View” Each program/process gets its own 4GB space Kernel Text Data Heap Stack Kernel Text Data Heap Stack Kernel Text Data Heap Stack

CPU’s View At some point, the CPU is going to have to load-from/store-to memory… all it knows is the real, A.K.A. physical memory … which unfortunately is often < 4GB … and is never 4GB per process

Pages Memory is divided into pages, which are nothing more than fixed sized and aligned regions of memory –Typical size: 4KB/page (but not always) … Page 0 Page 1 Page 2 Page 3

Page Table Map from virtual addresses to physical locations 0K 4K 8K 12K Virtual Addresses 0K 4K 8K 12K 16K 20K 24K 28K Physical Addresses “Physical Location” may include hard-disk Page Table implements this V  P mapping

Page Tables 0K 4K 8K 12K 0K 4K 8K 12K 16K 20K 24K 28K 0K 4K 8K 12K Physical Memory

Need for Translation Virtual Address Virtual Page NumberPage Offset Page Table Main Memory Physical Address 0xFC51908B 0x001520xFC519 0x B

Simple Page Table Flat organization –One entry per page –Entry contains physical page number (PPN) or indicates page is on disk or invalid –Also meta-data (e.g., permissions, dirtiness, etc.) One entry per page

Multi-Level Page Tables Level 1Level 2Page Offset Physical Page Number Virtual Page Number

Choosing a Page Size Page size inversely proportional to page table overhead Large page size permits more efficient transfer to/from disk –vs. many small transfers –Like downloading from Internet Small page leads to less fragmentation –Big page likely to have more bytes unused

CPU Memory Access Program deals with virtual addresses –“Load R1 = 0[R2]” On memory instruction 1.Compute virtual address (0[R2]) 2.Compute virtual page number 3.Compute physical address of VPN’s page table entry 4.Load* mapping 5.Compute physical address 6.Do the actual Load* from memory Could be more depending On page table organization

Impact on Performance? Every time you load/store, the CPU must perform two (or more) accesses! Even worse, every fetch requires translation of the PC! Observation: –Once a virtual page is mapped into a physical page, it’ll likely stay put for quite some time

Idea: Caching! Not caching of data, but caching of translations 0K 4K 8K 12K Virtual Addresses 0K 4K 8K 12K 16K 20K 24K 28K Physical Addresses X VPN 8 PPN 16

Translation Cache: TLB TLB = Translation Look-aside Buffer TLB Virtual Address Cache Data Physical Address Cache Tags Hit? If TLB hit, no need to do page table lookup from memory Note: data cache accessed by physical addresses now

PAPT Cache Previous slide showed Physically-Addressed Physically-Tagged cache –Sometimes called PIPT (I=Indexed) Con: TLB lookup and cache access serialized –Caches already take > 1 cycle Pro: cache contents valid so long as page table not modified

Virtually Addressed Cache Virtual Address Cache Data Cache Tags Hit? Pro: latency – no need to check TLB Con: Cache must be flushed on process change (VIVT: vitually indexed, virtually tagged) TLB On Cache Miss Physical Address To L2 How to enforce permissions?

Virtually Indexed Physically Tagged Virtual Address Cache Data Cache Tags Hit? Pro: latency – TLB parallelized Pro: don’t need to flush $ on process swap Con: Limit on cache indexing (can only use bits not from the VPN/PPN) TLB Physical Address = Physical Tag Big page size can help here

TLB Design Often fully-associative –For latency, this means few entries –However, each entry is for a whole page –Ex. 32-entry TLB, 4KB page… how big of working set while avoiding TLB misses? If many misses: –Increase TLB size (latency problems) –Increase page size (fragmenation problems)

Process Changes With physically-tagged caches, don’t need to flush cache on context switch –But TLB is no longer valid! –Add process ID to translation PID:0 VPN:8 PPN: 28 PID:1 VPN:8 PPN: 44 Only flush TLB when Recycling PIDs

SRAM vs. DRAM DRAM = Dynamic RAM SRAM: 6T per bit –built with normal high-speed CMOS technology DRAM: 1T per bit –built with special DRAM process optimized for density

Hardware Structures bb SRAM wordline b DRAM wordline

Implementing the Capacitor “Trench Cell” Cell Plate Si Cap Insulator Storage Node Poly Field Oxide Refilling Poly Si Substrate DRAM figures from this slide were taken from Prof. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley

DRAM Chip Organization Row Decoder Sense Amps Column Decoder Memory Cell Array Row Buffer Row Address Column Address Data Bus

DRAM Chip Organization (2) Differences with SRAM reads are destructive: contents are erased after reading –row buffer read lots of bits all at once, and then parcel them out based on different column addresses –similar to reading a full cache line, but only accessing one word at a time “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page –row address held constant, and then fast read from different locations from the same page

DRAM Read Operation Row Decoder Sense Amps Column Decoder Memory Cell Array Row Buffer 0x1FE 0x000 Data Bus 0x001 0x002 Accesses need not be sequential

Destructive Read 1 V dd Wordline Enabled Sense Amp Enabled bitline voltage V dd storage cell voltage sense amp 0 After read of 0 or 1, cell contains something close to 1/2

Row Buffer Refresh So after a read, the contents of the DRAM cell are gone The values are stored in the row buffer Write them back into the cells for the next read in the future Sense Amps DRAM cells

Refresh (2) Fairly gradually, the DRAM cell will lose its contents even if it’s not accessed –This is why it’s called “dynamic” –Contrast to SRAM which is “static” in that once written, it maintains its value forever (so long as power remains on) All DRAM rows need to be regularly read and re-written 1 Gate Leakage 0 If it keeps its value even if power is removed, then it’s “non-volatile” (e.g., flash, HDD, DVDs)

DRAM Read Timing Accesses are asynchronous: triggered by RAS and CAS signals, which can in theory occur at arbitrary times (subject to DRAM timing constraints)

SDRAM Read Timing Burst Length Double-Data Rate (DDR) DRAM transfers data on both rising and falling edge of the clock Timing figures taken from “A Performance Comparison of Contemporary DRAM Architectures” by Cuppu, Jacob, Davis and Mudge Command frequency does not change

Rambus (RDRAM) Synchronous interface Row buffer cache –last 4 rows accessed cached –higher probability of low-latency hit –DRDRAM increases this to 8 entries Uses other tricks since adopted by SDRAM –multiple data words per clock, high frequencies Chips can self-refresh Expensive for PC’s, used by X-Box, PS2

Example Memory Latency Computation FSB freq = 200 MHz, SDRAM RAS delay = 2, CAS delay = 2 A0, A1, B0, C0, D3, A2, D0, C1, A3, C3, C2, D1, B1, D2 What’s this in CPU cycles? (assume 2GHz) Impact on AMAT?

More Latency Width/Speed varies depending on memory type Significant wire delay just getting from the CPU to the memory controller More wire delay getting to the memory chips (plus the return trip…)

Memory Controller Memory Controller SchedulerBuffer Bank 0Bank 1 Commands Data Read Queue Write Queue Response Queue To/From CPU Like Write-Combining Buffer, Scheduler may coalesce multiple accesses together, or re-order to reduce number of row accesses

Memory Reference Scheduling Just like registers, need to enforce RAW, WAW, WAR dependencies No “memory renaming” in memory controller, so enforce all three dependencies Like everything else, still need to maintain appearance of sequential access –Consider multiple read/write requests to the same address

Example Memory Latency Computation (3) FSB freq = 200 MHz, SDRAM RAS delay = 2, CAS delay = 2 Scheduling in memory controller A0, A1, B0, C0, D3, A2, D0, C1, A3, C3, C2, D1, B1, D2 Think about hardware complexity…

So what do we do about it? Caching –reduces average memory instruction latency by avoiding DRAM altogether Limitations –Capacity programs keep increasing in size –Compulsory misses

Faster DRAM Speed Clock FSB faster –DRAM chips may not be able to keep up Latency dominated by wire delay –Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much Instead of 2 cycles for row access, may take 3 cycles at a faster bus speed Doesn’t address latency of the memory access

On-Chip Memory Controller All on same chip: No slow PCB wires to drive Memory controller can run at CPU speed instead of FSB clock speed Disadvantage: memory type is now tied to the CPU implementation Also: more sophisticated memory scheduling algorithms