Chapter 4 Memory Design: SOC and Board-Based Systems Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)
Cache and Memory cache memory performance cache partitioning multi-level cache memory off-die memory designs
Outline for memory design
Area comparison of memory tech.
System environments and memory
Performance factors Factors: physical word size block / line size Virtual address Factors: physical word size processor cache block / line size cache memory cache hit time cache size, organization cache miss time memory and bus virtual-to-real translation time number of processor requests per cycle
Design target miss rates beyond 1MB double the size half the miss rate
System effects limit hit rate operating System affects the miss ratio about 20% increase so does multiprogramming (M) miss rates may not be affected by increased cache size Q = no. instructions between task switches
System Effects Cold-Start short transactions are created frequently and run quickly to completion Warm-Start long processes are executed in time slices COLD
Some common cache types
Multi-level caches: mostly on die useful for matching processor to memory generally at least 2-level For microprocessors L1 at frequency of pipeline and L2 at slower latency often use 3-level Size limited by access time and improved cycle times
Cache partitioning: scaling effect on cache access time access time to a cache is approximately access time (ns) = (0.35 + 3.8f +(0.006 +0.025 f) C) x (1 + 0.3(1 - 1/A)) where f is the feature size in microns C is the cache capacity in K bytes A is the associativity, e.g. direct map A = 1 for example, at f = 0.1u, A = 1 and C = 32 (KB) the access time is 1.00 ns problem with small feature size: cache access time, not cache size
Minimum cache access time 1 array, larger sizes use multiple arrays (interleaving) L3: multiple 256KB arrays L2 usually less than 512KB (interleaved from smaller arrays) L1 usually less than 64kB
Analysis: multi-level cache miss rate L2 cache analysis by statistical inclusion if L2 cache > 4 x size of the L1 cache then assume statistically: contents of L1 lies in L2 relevant L2 miss rates local miss rate: No. L2 misses / No. L2 references global Miss Rate: No. misses / No. processor ref. solo Miss Rate: No. misses without L1/No. proc. ref. Inclusion => solo miss rate = global miss rate miss penalty calculation L1 miss rate x (miss in L1, hit in L2 penalty) plus L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)
Multi-level cache example Memory L1 L2 Miss Rate 4% 1% - delays: Miss in L1, Hit in L2 2 cycles Miss in L1, Miss in L2 15 cycles - assume one reference/instruction L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi Total effect of 2 level system is 0.08 + 0.13 = 0.29 cpi
Memory design logical inclusion embedded RAM off-die: DRAM basic memory model Strecker’s model
Physical memory system
Hierarchy of caches Name ? Size Access Transfer size L0 Registers <256 words <1 cycle word L1 Core local <64K <4 cycle Line L2 On Chip <64M <30 cycle L3 DRAM on Chip <1G <60 cycle >= Line M0 Off Chip Cache M1 Local Main Memory <16G <150 cycle M2 Cluster Memory
Hierarchy of caches Working Set – how much memory an “iteration” requires if it fits in a level then that will be the worst case if it does not, hit rate typically determines performance double the cache level size half the miss rate – good rule of thumb if 90% hit rate, 10x memory access time, performance 50% and that’s for 1 core
Logical inclusion multiprocessors with L1 and L2 caches Important: L1 cache does NOT contain a line sufficient to determine L2 cache does not have the line need to ensure all the contents of L1 are always in L2 this property: Logical Inclusion
Logical inclusion techniques passive control Cache size, organization, policies no. L2 sets no. L1 sets L2 set size L1 set size compatible replacement algorithms but: highly restrictive and difficult to guarantee active whenever a line is replaced or invalidated in the L2 ensure it is not present in L1 or it is evicted from L1
Memory system design outline memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks
Why BIG memory?
Memory many times, computation limited by memory not processor organization or cycle time memory: characterized by 3 parameters size access time: latency cycle time: bandwidth
Embedded RAM
Embedded RAM density (1)
Embedded RAM density (2)
Embedded RAM cycle time
Embedded RAM error rates
Off-die Memory Module module contains the DRAM chips that make up the physical memory word if the DRAM is organized 2n words x b bits and the memory has p bits/ physical word then the module has p/b DRAM chips. total memory size is then 2n words x p bits Parity or Error-Correction Code (ECC) generally required for error detection and availability
Simple asychronous DRAM array DRAM cell Capacitor: store charge for 0/1 state Transistor: switch capacitor to bit line Charge decays => refresh required DRAM array Stores 2n bits in a square array 2n/2 row lines connect to data lines 2n/2 column bit lines connect to sense amplifiers
DRAM basics Row read is destructive Sequence Read row into SRAM from dynamic memory(>1000 bits) Select word (<64 bits) Write Word into row (writing) Repeat till done with row WRITE back row into dynamic memory
DRAM timing row and column addresses muxed row and column Strobes for timing
Increase DRAM bandwidth Burst Mode aka page mode, nibble mode, fast page mode Synchronous DRAM (SDRAM) DDR SDRAM DDR1 DDR2 DDR3
(Dual Data Rate Synchronous DRAM) DDR SDRAM (Dual Data Rate Synchronous DRAM)
Burst mode burst mode most DDR SDRAMs: multiple rows can be open save most recently accessed row (“page”) only need column row + CAS to access within page most DDR SDRAMs: multiple rows can be open address counter in each row for sequential accesses only need CAS (DRAM) or bus clock (SDRAM) for sequential accesses
Configuration parameters Parameters for typical DRAM chips used in a 64-bit module
DRAM timing
Physical memory system
Basic memory model assume that n processors B(n,m) Tc each make 1 request per Tc to one of m memories B(n,m) number of successes Tc memory cycle time to the memory one processor making n requests per Tc behaves as n processors making 1 request per Tc
Achieved vs. offered bandwidth offered request rate rate at which processor(s) would make requests if memory had unlimited bandwidth and no contention
Basic terms B = B(m,n) or B(m) number of requests that succeed each Tc (= average number of busy modules) B: bandwidth normalized to Tc Ts: more generalized term for service time Tc = Ts BW: achieved bandwidth in requests serviced per second BW = B / Ts = B(m,n)/ Ts
Modeling + evaluation methodology relevant physical parameters for memory word size module size number of modules cycle time Tc (=Ts) find the offered Bandwidth number of requests/Ts find the bottleneck performance limited by most restrictive service point
Strecker’s model: compute B(m,n) model description each processor generates 1 reference per cycle requests randomly/uniformly distributed over modules any busy module serves 1 request all unserviced requests are dropped each cycle assume there are no queues B(m,n) = m[1 - (1 - 1/m)n] relative Performance Prel = B(m,n) / n
Deriving Strecker’s model Prob[given processor not reference module] = (1 – 1/m) Prob[no processor references module] = P[idle] = (1 – 1/m)n Prob[module busy] = 1 - (1 – 1/m)n average number of busy modules is B(m,n) B(m,n) = m[1 - (1 - 1/m)n]
Example 1 2 dual core processor dice share memory Ts = 24 ns each die has 2 processors sharing 4MB L2 miss rate is 0.001 misses reference each processor makes 3 references/cycle @ 4 GHz 2 x 2 x 3 x 0.001 =0.012 refs/cyc Ts = 4 x 24 cycles n = 1.152 processor requests / Ts; if m= 4 success rate B(m,n) = B(4,1.152) = 0.81 Relative Performance = B/n = .81/1.152 =0.7
Example 2 8-way interleaved associative data cache processor issues 2LD/ST per cycle each processor: data reference per cycle = 0.6 n = 2 ; m = 8 B(m,n) = B(8,1.2) = 1.18 Relative Performance = B/n = 1.18/1.2 = 0.98
Summary cache memory chip technology static versus dynamic: performance, cache partitioning, multi-level cache memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks