Chapter 4 Memory Design: SOC and Board-Based Systems

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
Computer Organization and Architecture
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory.
1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
EECC550 - Shaaban #1 Lec # 10 Summer Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.
Main Memory by J. Nelson Amaral.
Overview Booth’s Algorithm revisited Computer Internal Memory Cache memory.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computing Systems Memory Hierarchy.
CH05 Internal Memory Computer Memory System Overview Semiconductor Main Memory Cache Memory Pentium II and PowerPC Cache Organizations Advanced DRAM Organization.
Memory Technology “Non-so-random” Access Technology:
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Survey of Existing Memory Devices Renee Gayle M. Chua.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Main Memory CS448.
CPEN Digital System Design
University of Tehran 1 Interface Design DRAM Modules Omid Fatemi
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Computer Architecture Lecture 24 Fasih ur Rehman.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
Types of RAM (Random Access Memory) Information Technology.
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CS 704 Advanced Computer Architecture
Types of RAM (Random Access Memory)
Reducing Hit Time Small and simple caches Way prediction Trace caches
Morgan Kaufmann Publishers Memory & Cache
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
BIC 10503: COMPUTER ARCHITECTURE
Lecture 15: Memory Design
AKT211 – CAO 07 – Computer Memory
William Stallings Computer Organization and Architecture 8th Edition
Cache - Optimization.
Presentation transcript:

Chapter 4 Memory Design: SOC and Board-Based Systems Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

Cache and Memory cache memory performance cache partitioning multi-level cache memory off-die memory designs

Outline for memory design

Area comparison of memory tech.

System environments and memory

Performance factors Factors: physical word size block / line size Virtual address Factors: physical word size processor  cache block / line size cache  memory cache hit time cache size, organization cache miss time memory and bus virtual-to-real translation time number of processor requests per cycle

Design target miss rates beyond 1MB double the size half the miss rate

System effects limit hit rate operating System affects the miss ratio about 20% increase so does multiprogramming (M) miss rates may not be affected by increased cache size Q = no. instructions between task switches

System Effects Cold-Start short transactions are created frequently and run quickly to completion Warm-Start long processes are executed in time slices COLD

Some common cache types

Multi-level caches: mostly on die useful for matching processor to memory generally at least 2-level For microprocessors L1 at frequency of pipeline and L2 at slower latency often use 3-level Size limited by access time and improved cycle times

Cache partitioning: scaling effect on cache access time access time to a cache is approximately access time (ns) = (0.35 + 3.8f +(0.006 +0.025 f) C) x (1 + 0.3(1 - 1/A)) where f is the feature size in microns C is the cache capacity in K bytes A is the associativity, e.g. direct map A = 1 for example, at f = 0.1u, A = 1 and C = 32 (KB) the access time is 1.00 ns problem with small feature size: cache access time, not cache size

Minimum cache access time 1 array, larger sizes use multiple arrays (interleaving) L3: multiple 256KB arrays L2 usually less than 512KB (interleaved from smaller arrays) L1 usually less than 64kB

Analysis: multi-level cache miss rate L2 cache analysis by statistical inclusion if L2 cache > 4 x size of the L1 cache then assume statistically: contents of L1 lies in L2 relevant L2 miss rates local miss rate: No. L2 misses / No. L2 references global Miss Rate: No. misses / No. processor ref. solo Miss Rate: No. misses without L1/No. proc. ref. Inclusion => solo miss rate = global miss rate miss penalty calculation L1 miss rate x (miss in L1, hit in L2 penalty) plus L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)

Multi-level cache example Memory L1 L2 Miss Rate 4% 1% - delays: Miss in L1, Hit in L2 2 cycles Miss in L1, Miss in L2 15 cycles - assume one reference/instruction L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi Total effect of 2 level system is 0.08 + 0.13 = 0.29 cpi

Memory design logical inclusion embedded RAM off-die: DRAM basic memory model Strecker’s model

Physical memory system

Hierarchy of caches Name ? Size Access Transfer size L0 Registers <256 words <1 cycle word L1 Core local <64K <4 cycle Line L2 On Chip <64M <30 cycle L3 DRAM on Chip <1G <60 cycle >= Line M0 Off Chip Cache M1 Local Main Memory <16G <150 cycle M2 Cluster Memory

Hierarchy of caches Working Set – how much memory an “iteration” requires if it fits in a level then that will be the worst case if it does not, hit rate typically determines performance double the cache level size half the miss rate – good rule of thumb if 90% hit rate, 10x memory access time, performance 50% and that’s for 1 core

Logical inclusion multiprocessors with L1 and L2 caches Important: L1 cache does NOT contain a line sufficient to determine L2 cache does not have the line need to ensure all the contents of L1 are always in L2 this property: Logical Inclusion

Logical inclusion techniques passive control Cache size, organization, policies no. L2 sets no. L1 sets L2 set size L1 set size compatible replacement algorithms but: highly restrictive and difficult to guarantee active whenever a line is replaced or invalidated in the L2 ensure it is not present in L1 or it is evicted from L1

Memory system design outline memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks

Why BIG memory?

Memory many times, computation limited by memory not processor organization or cycle time memory: characterized by 3 parameters size access time: latency cycle time: bandwidth

Embedded RAM

Embedded RAM density (1)

Embedded RAM density (2)

Embedded RAM cycle time

Embedded RAM error rates

Off-die Memory Module module contains the DRAM chips that make up the physical memory word if the DRAM is organized 2n words x b bits and the memory has p bits/ physical word then the module has p/b DRAM chips. total memory size is then 2n words x p bits Parity or Error-Correction Code (ECC) generally required for error detection and availability

Simple asychronous DRAM array DRAM cell Capacitor: store charge for 0/1 state Transistor: switch capacitor to bit line Charge decays => refresh required DRAM array Stores 2n bits in a square array 2n/2 row lines connect to data lines 2n/2 column bit lines connect to sense amplifiers

DRAM basics Row read is destructive Sequence Read row into SRAM from dynamic memory(>1000 bits) Select word (<64 bits) Write Word into row (writing) Repeat till done with row WRITE back row into dynamic memory

DRAM timing row and column addresses muxed row and column Strobes for timing

Increase DRAM bandwidth Burst Mode aka page mode, nibble mode, fast page mode Synchronous DRAM (SDRAM) DDR SDRAM DDR1 DDR2 DDR3

(Dual Data Rate Synchronous DRAM) DDR SDRAM (Dual Data Rate Synchronous DRAM)

Burst mode burst mode most DDR SDRAMs: multiple rows can be open save most recently accessed row (“page”) only need column row + CAS to access within page most DDR SDRAMs: multiple rows can be open address counter in each row for sequential accesses only need CAS (DRAM) or bus clock (SDRAM) for sequential accesses

Configuration parameters Parameters for typical DRAM chips used in a 64-bit module

DRAM timing

Physical memory system

Basic memory model assume that n processors B(n,m) Tc each make 1 request per Tc to one of m memories B(n,m) number of successes Tc memory cycle time to the memory one processor making n requests per Tc behaves as n processors making 1 request per Tc

Achieved vs. offered bandwidth offered request rate rate at which processor(s) would make requests if memory had unlimited bandwidth and no contention

Basic terms B = B(m,n) or B(m) number of requests that succeed each Tc (= average number of busy modules) B: bandwidth normalized to Tc Ts: more generalized term for service time Tc = Ts BW: achieved bandwidth in requests serviced per second BW = B / Ts = B(m,n)/ Ts

Modeling + evaluation methodology relevant physical parameters for memory word size module size number of modules cycle time Tc (=Ts) find the offered Bandwidth number of requests/Ts find the bottleneck performance limited by most restrictive service point

Strecker’s model: compute B(m,n) model description each processor generates 1 reference per cycle requests randomly/uniformly distributed over modules any busy module serves 1 request all unserviced requests are dropped each cycle assume there are no queues B(m,n) = m[1 - (1 - 1/m)n] relative Performance Prel = B(m,n) / n

Deriving Strecker’s model Prob[given processor not reference module] = (1 – 1/m) Prob[no processor references module] = P[idle] = (1 – 1/m)n Prob[module busy] = 1 - (1 – 1/m)n average number of busy modules is B(m,n) B(m,n) = m[1 - (1 - 1/m)n]

Example 1 2 dual core processor dice share memory Ts = 24 ns each die has 2 processors sharing 4MB L2 miss rate is 0.001 misses reference each processor makes 3 references/cycle @ 4 GHz 2 x 2 x 3 x 0.001 =0.012 refs/cyc Ts = 4 x 24 cycles n = 1.152 processor requests / Ts; if m= 4 success rate B(m,n) = B(4,1.152) = 0.81 Relative Performance = B/n = .81/1.152 =0.7

Example 2 8-way interleaved associative data cache processor issues 2LD/ST per cycle each processor: data reference per cycle = 0.6 n = 2 ; m = 8 B(m,n) = B(8,1.2) = 1.18 Relative Performance = B/n = 1.18/1.2 = 0.98

Summary cache memory chip technology static versus dynamic: performance, cache partitioning, multi-level cache memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks