Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression2 Overview Design of high performance processors Processor speed improves faster than memory Processor speed improves faster than memory Memory latency dominates performance Need more effective cache designs Need more effective cache designs On-chip cache compression + Increases effective cache size - Increases cache hit latency Does cache compression help or hurt?
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression3 Does Cache Compression Help or Hurt?
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression4 Does Cache Compression Help or Hurt?
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression5 Does Cache Compression Help or Hurt?
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression6 Does Cache Compression Help or Hurt? Adaptive Compression determines when compression is beneficial
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression7 Outline Motivation Cache Compression Framework Compressed Cache Hierarchy Compressed Cache Hierarchy Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache Adaptive Compression Evaluation Conclusions
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression8 Compressed Cache Hierarchy Instruction Fetcher Fetcher L2 Cache (Compressed) L1 D-Cache (Uncompressed) Load-StoreQueue L1 I-Cache (Uncompressed) L1 Victim Cache CompressionPipeline DecompressionPipeline UncompressedLineBypass From Memory To Memory
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression9 Address B Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Address A Tag Area 2-way set-associative with 64-byte lines Tag Contains Address Tag, Permissions, LRU (Replacement) Bits
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression10 Address B Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Add two more tags
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression11 Address B Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Add Compression Size, Status, More LRU bits
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression12 Address B Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Divide Data Area into 8-byte segments
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression13 Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Tag Area Address B Address A Address C Address D Data lines composed of 1-8 segments
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression14 Addr B compressed 2 Decoupled Variable-Segment Cache Objective: pack more lines into the same space Data Area Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4 Tag Area Compression Status Compressed Size Tag is present but line isn’t
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression15 Outline Motivation Cache Compression Framework Adaptive Compression Key Insight Key Insight Classification of L2 accesses Classification of L2 accesses Global compression predictor Global compression predictor Evaluation Conclusions
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression16 Adaptive Compression Use past to predict future Key Insight: LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts Benefit(Compression ) > Cost(Compression ) Do not compress future lines Compress Yes No
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression17 Cost/Benefit Classification Classify each cache reference Four-way SA cache with space for two 64-byte lines Total of 16 available segments Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression18 An Unpenalized Hit Read/Write Address A LRU Stack order = 1 ≤ 2 Hit regardless of compression Uncompressed Line No decompression penalty Neither cost nor benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression19 A Penalized Hit Read/Write Address B LRU Stack order = 2 ≤ 2 Hit regardless of compression Compressed Line Decompression penalty incurred Compression cost Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression20 An Avoided Miss Read/Write Address C LRU Stack order = 3 > 2 Hit only because of compression Compression benefit: Eliminated off-chip miss Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression21 An Avoidable Miss Read/Write Address D Line is not in the cache but tag exists at LRU stack order = 4 Missed only because some lines are not compressed Potential compression benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4 Sum(CSize) = 15 ≤ 16
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression22 An Unavoidable Miss Read/Write Address E LRU stack order > 4 Compression wouldn’t have helped Line is not in the cache and tag does not exist Neither cost nor benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression23 Compression Predictor Estimate: Benefit(Compression) – Cost(Compression) Single counter : Global Compression Predictor (GCP) Saturating up/down 19-bit counter Saturating up/down 19-bit counter GCP updated on each cache access Benefit: Increment by memory latency Benefit: Increment by memory latency Cost: Decrement by decompression latency Cost: Decrement by decompression latency Optimization: Normalize to decompression latency = 1 Optimization: Normalize to decompression latency = 1 Cache Allocation Allocate compressed line if GCP 0 Allocate compressed line if GCP 0 Allocate uncompressed lines if GCP < 0 Allocate uncompressed lines if GCP < 0
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression24 Outline Motivation Cache Compression Framework Adaptive Compression Evaluation Simulation Setup Simulation Setup Performance Performance Conclusions
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression25 Simulation Setup Simics full system simulator augmented with: Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] Detailed memory timing simulator [Martin, et al., 2002] Detailed memory timing simulator [Martin, et al., 2002] Workloads: Commercial workloads: Commercial workloads: Database servers: OLTP and SPECJBB Static Web serving: Apache and Zeus SPEC2000 benchmarks: SPEC2000 benchmarks: SPECint: bzip, gcc, mcf, twolf SPECfp: ammp, applu, equake, swim
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression26 System configuration A dynamically scheduled SPARC V9 uniprocessor Configuration parameters: L1 Cache Split I&D, 64KB each, 2-way SA, 64B line, 2- cycles/access L2 Cache Unified 4MB, 8-way SA, 64B line, 20cycles+decompression latency per access Memory 4GB DRAM, 400-cycle access time, 128 outstanding requests Processor pipeline 4-wide superscalar, 11-stage pipeline: fetch (3), decode(3), schedule(1), execute(1+), retire(3) Reorder buffer 64 entries
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression27 Simulated Cache Configurations Always: All compressible lines are stored in compressed format Decompression penalty for all compressed lines Decompression penalty for all compressed lines Never: All cache lines are stored in uncompressed format Cache is 8-way set associative with half the number of sets Cache is 8-way set associative with half the number of sets Does not incur decompression penalty Does not incur decompression penalty Adaptive: Our adaptive compression scheme
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression28 Performance SpecINT SpecFPCommercial
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression29 Performance
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression30 Performance 35% Speedu p 18% Slowdown
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression31 Performance Adaptive performs similar to the best of Always and Never Bug in GCP update
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression32 Effective Cache Capacity
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression33 Cache Miss Rates Penalized Hits Per Avoided Miss Misses Per 1000 Instructions
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression34 Adapting to L2 Sizes Misses Per 1000 Instructions Penalized Hits Per Avoided Miss
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression35 Conclusions Cache compression increases cache capacity but slows down cache hit time Helps some benchmarks (e.g., apache, mcf) Helps some benchmarks (e.g., apache, mcf) Hurts other benchmarks (e.g., gcc, ammp) Hurts other benchmarks (e.g., gcc, ammp) Our Proposal: Adaptive compression Uses (LRU) replacement stack to determine whether compression helps or hurts Uses (LRU) replacement stack to determine whether compression helps or hurts Updates a single global saturating counter on cache accesses Updates a single global saturating counter on cache accesses Adaptive compression performs similar to the better of Always Compress and Never Compress
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression36 Backup Slides Frequent Pattern Compression (FPC) Frequent Pattern Compression (FPC) Frequent Pattern Compression (FPC) Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache Classification of L2 Accesses Classification of L2 Accesses Classification of L2 Accesses (LRU) Stack Replacement (LRU) Stack Replacement (LRU) Stack Replacement Cache Miss Rates Cache Miss Rates Cache Miss Rates Adapting to L2 Sizes – mcf Adapting to L2 Sizes Adapting to L2 Sizes Adapting to L1 Size Adapting to L1 Size Adapting to L1 Size Adapting to Decompression Latency – mcf Adapting to Decompression Latency Adapting to Decompression Latency Adapting to Decompression Latency – ammp Adapting to Decompression Latency Adapting to Decompression Latency Phase Behavior – gcc Phase Behavior Phase Behavior Phase Behavior – mcf Phase Behavior Phase Behavior Can We Do Better Than Adaptive? Can We Do Better Than Adaptive? Can We Do Better Than Adaptive?
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression37 Decoupled Variable-Segment Cache Each set contains four tags and space for two uncompressed lines Data area divided into 8-byte segments Each tag is composed of: Address tag Address tag Permissions Permissions CStatus : 1 if the line is compressed, 0 otherwise CStatus : 1 if the line is compressed, 0 otherwise CSize: Size of compressed line in segments CSize: Size of compressed line in segments LRU/replacement bits LRU/replacement bits Same as uncompressed cache
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression38 Frequent Pattern Compression A significance-based compression algorithm Related Work: X-Match and X-RL Algorithms [Kjelso, et al., 1996] X-Match and X-RL Algorithms [Kjelso, et al., 1996] Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] A 64-byte line is decompressed in five cycles More details in technical report: “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR , April 2004 (available online). “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR , April 2004 (available online).
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression39 Frequent Pattern Compression (FPC) A significance-based compression algorithm combined with zero run-length encoding Compresses each 32-bit word separately Compresses each 32-bit word separately Suitable for short ( byte) cache lines Suitable for short ( byte) cache lines Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero- padded half-word, two SE half-words, repeated byte Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero- padded half-word, two SE half-words, repeated byte A 64-byte line is decompressed in a five-stage pipeline A 64-byte line is decompressed in a five-stage pipeline More details in technical report: “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR , April 2004 (available online). “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR , April 2004 (available online).
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression40 Classification of L2 Accesses Cache hits: Unpenalized hit: Hit to an uncompressed line that would have hit without compression Unpenalized hit: Hit to an uncompressed line that would have hit without compression - Penalized hit: Hit to a compressed line that would have hit without compression + Avoided miss: Hit to a line that would NOT have hit without compression Cache misses: + Avoidable miss: Miss to a line that would have hit with compression Unavoidable miss: Miss to a line that would have missed even with compression Unavoidable miss: Miss to a line that would have missed even with compression
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression41 Differentiate penalized hits and avoided misses? Only hits to top half of the tags in the LRU stack are penalized hits Only hits to top half of the tags in the LRU stack are penalized hits Differentiate avoidable and unavoidable misses? Is not dependent on LRU replacement Any replacement algorithm for top half of tags Any replacement algorithm for top half of tags Any stack algorithm for the remaining tags Any stack algorithm for the remaining tags (LRU) Stack Replacement
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression42 Cache Miss Rates
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression43 Adapting to L2 Sizes x x Misses Per 1000 Instructions Penalized Hits Per Avoided Miss
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression44 Adapting to L1 Size
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression45 Adapting to Decompression Latency
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression46 Adapting to Decompression Latency
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression47 Phase Behavior Predictor Value (K) Cache Size (MB)
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression48 Phase Behavior Predictor Value (K) Cache Size (MB)
ISCA 2004Alaa Alameldeen – Adaptive Cache Compression49 Can We Do Better Than Adaptive? Optimal is an unrealistic configuration: Always with no decompression penalty