Download presentation
Presentation is loading. Please wait.
Published byMarjory Wilson Modified over 9 years ago
1
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1
2
2
3
3
4
Communication vs. Computation Keckler Micro 2011 Improving cache utilization is critical for energy-efficiency! ~200X
5
Compressed Cache: Compress and Compact Blocks +Higher effective cache size +Small area overhead +Higher system performance +Lower system energy Previous work limit compression effectiveness: -Limited number of tags -High internal fragmentation -Energy expensive re-compaction
6
6 Non-Contiguous Sub-Blocks Previous work limit compression effectiveness: -Limited number of tags -High Internal Fragmentation -Energy expensive re-compaction Decoupled Super-Blocks
7
7
8
Outline 8 Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
9
9 Uncompressed Caching A fixed one-to-one tag/data mapping Tags Data
10
10 Compressed Caching Compress cache blocks. Tags Data Compact compressed blocks, to make room.Add more tags to increase effective capacity.
11
11 Compression (1) Compression: how to compress blocks? There are different compression algorithms. Not the focus of this work. But, which algorithm matters! 64 bytes 20 bytes Compressor
12
Compression Potentials 12 High compression ratio potentially large normalized effective cache capacity. 1.5 2.8 3.9 Compression Ratio = Original Size / Compressed Size Cycles to Decompress Compression Algorithm We use C-PACK+Z for the rest of the talk!
13
13 Compaction (2) Compaction: how to store and find blocks? Critical to achieve the compression potentials. This work focuses on compaction. Tags Data Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02] Internal Fragmentation!
14
14 Compaction (2) Compaction: how to store and find blocks? Tags Data Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Sub-block
15
Previous Compressed Caches 15 (Limit 1) Limited Tag/Metadata – High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation – Low Cache Capacity Utilization 10B 16B 2.6 2.3 2.0 1.7 Potential: 3.9 3.1 Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks
16
(Limit 3) Energy-Expensive Re-Compaction 16 3X higher LLC dynamic energy! Tags Data VSC requires energy-expensive re-compaction. Update BB needs 2 sub-blocks
17
Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 17
18
Decoupled Compressed Cache 18 (1) Exploiting Spatial Locality Low Area Overhead (2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation
19
(1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 19 89%
20
(1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. 20 4X Tags Tags Data 2X Tags Quad (Q): A, B, C, D Singleton (S): E Super-Block Tag Q state A state B state C state D Super Tags Up to 4X blocks with low area overheads!
21
(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. 21 Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Flexible Allocation Update B
22
(2) Decoupling tag/data mapping 22 Back pointers identify the owner block of each sub-block. Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Data Back Pointers Tag IDBlk ID
23
(3) Co-compacting super-blocks Co-DCC dynamically co ‑ compacts super-blocks. Reducing internal fragmentation 23 A sub-block Quad (Q): A, B, C, D
24
Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 24
25
Experimental Methodology 25 Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential regions at LLC in detail. – No need for an alignment network. Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection.
26
Experimental Methodology 26 Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache sensitivities: – Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise – Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar- bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus- bzip, omnetpp-lbm CoresEight OOO cores, 3.2 GHz L1I$/L1D$Private, 32-KB, 8-way L2$Private, 256-KB, 8-way L3$Shared, 8-MB, 16-way, 8 banks Main Memory4GB, 16 Banks, 800 MHz bus frequency DDR3
27
Effective LLC Capacity 27 ComponentsFixedC/VSC-2XDCCCo-DCC Tag Array 6.3%2.1%11.3% Back Pointer Array 04.4%5.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%18.5% 1 2 12 3 Normalized LLC Area Baseline 2X Baseline VSC DCC Co-DCC FixedC Normalized Effective LLC Capacity ComponentsFixedC/VSC-2X Tag Array 6.3% Back Pointer Array 0 (De-)Compressors 1.8% Total Area Overhead 8.1% ComponentsFixedC/VSC-2XDCC Tag Array 6.3%2.1% Back Pointer Array 04.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%
28
(Co-)DCC Performance 28 0.93 0.96 0.95 0.90 0.86 (Co-)DCC boost system performance significantly.
29
(Co-)DCC Energy Consumption 29 0.93 0.96 0.97 0.91 0.88 (Co-)DCC reduce system energy by reducing number of accesses to the main memory.
30
Summary 30 Analyze the limits of compressed caching Limited number of tags Internal fragmentation Energy-expensive re-compaction Decoupled Compressed Cache Improving performance and energy of compressed caching Decoupled super-blocks Non-contiguous sub-blocks Co-DCC further reduces internal fragmentation Practical designs [details in the paper]
31
(De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy 31 Backup
32
(De-)Compression Overhead 32 Parameters CompressorDecompressor Pipeline Depth62 Latency (cycles)169 0.016 Power Consumption (mW)25.8419.01
33
DCC Data Array Organization AMD Bulldozer 33
34
DCC Timing 34
35
DCC Lookup 1.Access Super Tags and Back Pointers in parallel 2.Find the matched Back Pointers 3.Read corresponding sub-blocks and decompress 35 Quad (Q): A, B, C, D Singleton (S): E Super Tags DataBack Pointers Read C Q 1010 S 1111 1111 1111
36
Applications 36 Spec2006 (m1-m8) bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm Sensitive to Cache Capacity and Latency Sensitive to Cache Capacity Cache Insensitive Sensitive to Cache Latency
37
Co-DCC Design 37
38
LLC Effective Cache Capacity 38
39
LLC Miss Rate 39
40
Memory Dynamic Energy 40
41
LLC Dynamic Energy 41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.