Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy 6 September, 2018 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Block Placement Determined by associativity Direct mapped (1-way associative) One choice for placement n-way set associative n choices within a set Fully associative Any location Higher associativity reduces miss rate Increases complexity, cost, and access time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Finding a Block 6 September, 2018 Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Set index, then search entries within the set n Fully associative Search all entries #entries in cache Full lookup table Hardware caches Choice depends on the cost of a miss versus the cost of implementing associativity, both in time and in extra hardware. Reduce comparisons to reduce cost Fully associative caches are prohibitive except for small sizes Virtual memory Full table lookup makes full associativity feasible Benefit in reduced miss rate through sophisticated replacement schemes The full map can be easily indexed with no extra hardware and no searching required Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Replacement Choice of entry to replace on a miss for set associative and fully associative caches Least recently used (LRU) Complex and costly hardware for high associativity For larger associativity, either LRU is approximated or random replacement is used Random Close to LRU, easier to implement in hardware For a two-way set-associative cache, random replacement has a miss rate about 1.1 times higher than LRU replacement. Virtual memory LRU approximation as tiny reduction in the miss rate can be important when the cost of a miss is enormous Because misses are so expensive and relatively infrequent, approximating this information primarily in software is acceptable. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Write Policy Write-through Update both upper and lower levels Simplifies replacement, but may require write buffer Write-back Update upper level only Update lower level when block is replaced Need to keep more state Virtual memory Only write-back is feasible, given disk write latency Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Write Policy Advantages of write-back Advantages of write through Individual words written by the processor at the rate of the cache. Multiple writes within a block require only one write to the lower level. When blocks are written back, a high-bandwidth transfer, since the entire block is written. Advantages of write through Misses are simpler and cheaper because they never require a block to be written back to the lower level. Write-through is easier to implement than write-back, although it will need a write buffer.
Sources of Misses (three C’s) Morgan Kaufmann Publishers 6 September, 2018 Sources of Misses (three C’s) Compulsory misses (aka cold start misses) First access to a block Capacity misses Due to finite cache size A replaced block is later accessed again Conflict misses (aka collision misses) In a non-fully associative cache Due to competition for entries in a set Would not occur in a fully associative cache of the same total size Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Cache Design Trade-offs Morgan Kaufmann Publishers 6 September, 2018 Cache Design Trade-offs Design change Effect on miss rate Negative performance effect Increase cache size Decrease capacity misses May increase access time Increase associativity Decrease conflict misses Increase block size Decrease compulsory misses Increases miss penalty. For very large block size, may increase miss rate due to pollution. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Cache Control 6 September, 2018 Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes, 128 bits) Cache size: 16 KiB (1024 blocks) 32-bit addresses Valid bit and dirty bit per block Tag Index Offset 3 4 9 10 31 4 bits 10 bits 18 bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Interface Signals Cache Memory CPU Read/Write Read/Write Valid Valid 32 32 Address Address 32 128 Write Data Write Data 32 128 Read Data Read Data Ready Ready Multiple cycles per access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Finite State Machines 6 September, 2018 Use an FSM to sequence control steps Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state) Blocking cache CPU waits until access is complete Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Cache Controller FSM 6 September, 2018 Waits for read or write request from processor tests if the requested read or write is a hit or a miss Could partition into separate states to reduce clock cycle time writes the 128-bit block to memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Parallelism and Memory Hierarchy: Cache Coherence Morgan Kaufmann Publishers Parallelism and Memory Hierarchy: Cache Coherence 6 September, 2018 Suppose two CPU cores share a physical address space Caching shared data introduces a new problem as each processor views its own cache Write-through caches Time step Event CPU A’s cache CPU B’s cache Memory 1 CPU A reads X 2 CPU B reads X 3 CPU A writes 1 to X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Coherence Defined Informally: Reads return most recently written value Two aspects of memory system behaviour Coherence: what values can be returned by a read Consistency: determines when a written value will be returned by a read Formally, a memory system is coherent if: P writes X; P reads X (no intervening writes by another processor) read returns value written by P P1 writes X; P2 reads X (sufficiently later and no other writes to X in-between) read returns written value c.f. CPU B should read X as 1 after step 3 in example P1 writes X, P2 writes X (writes are serialized) all processors see writes in the same order End up with the same final value for X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Cache Coherence Protocols Morgan Kaufmann Publishers 6 September, 2018 Cache Coherence Protocols Operations performed by caches in multiprocessors to ensure coherence Migration of data to local caches Reduces latency and bandwidth for shared memory Replication of shared data read simultaneously Reduces latency of access and contention for a read shared data item. Snooping protocols Each cache monitors bus reads/writes Directory-based protocols Caches and memory record sharing status of blocks in a directory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Invalidating Snooping Protocols Morgan Kaufmann Publishers 6 September, 2018 Invalidating Snooping Protocols Write invalidate protocol: processor gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory CPU A reads X Cache miss for X CPU B reads X CPU A writes 1 to X Invalidate for X 1 CPU B read X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 6 September, 2018 Memory Consistency When are writes seen by other processors “Seen” means a read returns the written value Can’t be instantaneously Assumptions A write completes only when all processors have seen it A processor does not reorder writes with other accesses Consequence P writes X then writes Y all processors that see new Y also see new X Processors can reorder reads, but not writes Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Redundant Arrays of Inexpensive Disks (RAID) Impact of I/O on System Performance elapsed time =100 s, CPU time = 90 s, rest is I/O time #of processors double every 2 years, speed of processor remains same, I/O time remains same I/O time = elapsed time – CPU time = 10 s After 0 years, CPU time = 90s, I/O time = 10s, elapsed time= 100s, %I/O time = 10% After 2 years, CPU time = 45s, I/O time = 10s, elapsed time= 55s, %I/O time = 18% After 0 years, CPU time = 23s, I/O time = 10s, elapsed time= 33s, %I/O time = 31% After 0 years, CPU time = 11s, I/O time = 10s, elapsed time= 21s, %I/O time = 47% Improvement in CPU = 90/11=8, improvement in elapsed time = 100/21=4.7, increase in I/O time from 10% to 47%
Redundant Arrays of Inexpensive Disks (RAID)
Redundant Arrays of Inexpensive Disks (RAID) No redundancy Data striped across all disks Increase speed Multiple data requests probably not on same disk Disks seek in parallel A set of data is likely to be striped across multiple disks RAID 1 Mirroring or shadowing, uses twice as many disks. On a write, every strip is written twice. On a read, either copy can be used, distributing the load over more drives. Most expensive RAID solution, since it requires the most disks.
Redundant Arrays of Inexpensive Disks (RAID) Each byte is split into a pair of 4-bit nibbles, and Hamming code added to form a 7-bit word, of which bits 1, 2 and 4 are parity bits. all drives have to be rotationally synchronized Not in use now
Redundant Arrays of Inexpensive Disks (RAID) simplified version of RAID level 2 a single parity bit is computed for each data word and written to a parity drive A single parity bit provides a full 1-bit error correction for the case of a drive crashing, since the position of the bad bit is known. RAID 4 A strip-for-strip parity written onto an extra drive. All the strips are EXCLUSIVE ORed together, resulting in a parity strip Performs poorly for small updates.
Redundant Arrays of Inexpensive Disks (RAID) The bottleneck in RAID 4 is eliminated by distributing the parity bits uniformly over all the drives, round robin fashion. In the event of a drive crash, reconstructing the contents of the failed drive is a complex process. RAID 6 Two different parity calculations are carried out and stored in separate blocks on different disks Advantage: provides extremely high data availability Three disks would have to fail within the mean time to repair (MTTR) interval to cause data to be lost Incurs a substantial write penalty because each write affects two parity blocks Performance benchmarks show a RAID 6 controller can suffer more than a 30% drop in overall write performance compared with a RAID 5 implementation.
The ARM Cortex-A53 and Intel Core i7 Memory Hierarchies Morgan Kaufmann Publishers The ARM Cortex-A53 and Intel Core i7 Memory Hierarchies 6 September, 2018 2-Level TLB Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy