Download presentation
Presentation is loading. Please wait.
Published byLisa Chambers Modified over 8 years ago
1
A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中
2
Outline Introduction Contributions Effectiveness of Zero-Aware Compressors The Proposed Compression Scheme Performance Results Conclusions
3
Introduction Memory resources are wasted to compensate for the increasing processor/memory/disk speedgap >50% of die size occupied by caches >50% of cost of a server is DRAM (and increasing) Lossless data compression techniques have the potential to free up more than 50% of memory resources. Unfortunately, compression introduces several challenging design and performance issues
4
core L1-cache L2-cache 0 64 128 192 256 320 384 448 Main memory space Request Data
5
core L1-cache L2-cache 0 64 128 192 256 320 384 448 Compressed main memory space Translation table Address Translation Request Decompressor Data 0 64 128 192 256 320 384 448 Fragmented compressed main memory space
6
Contributions A low-overhead main-memory compression scheme: Low decompression latency by using simple and fast algorithm (zero aware) Fast address translation by a proposed small translation structure that fits on the processor die Reduction of fragmentation through occassional relocation of data when compressibility varies Overall, our compression scheme frees up 30% of the memory at a marginal performance loss of 0.2%!
7
Frequency of zero-valued locations 12% of all 8KB pages only contain zeros 30% of all 64B blocks only contain zeros 42% of all 4B words only contain zeros 55% of all bytes are zero! Zero-aware compression schemes have a great potential!
8
System Overview
9
Evaluated Algorithms Zero aware algorithms: FPC (Alameldeen and Wood) + 3 simplified versions For comparison, we also consider: X-Match Pro (efficient hardware implementations exist) LZSS (popular algorithm, previously used by IBM for memory compression) Deflate (upper bound on compressibility)
10
Frequent Pattern Compression (Alameldeen and Wood) PrefixPattern encodedData size 00Zero word0 01One byte sign-extended8 bits 10halfword sign-extended16 bits 11Uncompressed32 bits 3 bits (for runs up to 8 ”0”) Zero run Each 32-bit word is coded using a prefix plus data: PrefixPattern encodedData size 000Zero run3 bits (for runs up to 8 ”0”) 0014-bit sign-extended4 bits 010One byte sign-extended8 bits 011Half word sign-extended16 bits 100Half word padded with a zero halfword16 bits 101Two halfwords, each a byte sign-ext.16 bits 110Word consisting of repeated bytes8 bits 111Uncompressed32 bits
11
Resulting Compressed Sizes Main observations: FPC and all its variations can free up about 45% of memory LZSS and X-MatchPro only marginally better in spite of complexity Deflate can free up about 80% of memory but not clear how to exploit it Fast and efficient compression algorithms exist! SpecInt SpecFP Server
12
Three FPC versions FPC only zeros –Use eight patterns but does not code sign-extended negative numbers FPC simple –Use four patterns (zero run, 1 byte SE, half word SE,and uncompressed) FPC simple on zeros –Use four patterns but does not allow sign-extended negative numbers
13
Uncompressed data Compressed data Compressed fragmented data 01 00 11 10 00 01 10 11 Block size vector Address translation A block is assigned one out of n predefined sizes. In this example n=4. 01 00 11 10 00 01 10 11 11 10 01 00 00 00 11 11 00 01 00 10 10 00 10 01 01 10 10 10 10 11 10 00 Block Size Table (BST)TLB Address Calculator OS changes –Block size vector is kept in page table –Each page is assigned one out of k predefined sizes. Physical address grows with log 2 k bits. The Block Size Table enables fast translation!
14
Size changes and compaction Sub-page 0 Sub-page 1 sub-page slack Terminology: block overflow/underflow, sub-page overflow/underflow, page overflow/underflow Block overflow slack Block underflow page slack
15
Handling of overflows/underflows Block and sub-page overflows/underflows implies moving data within a page On a page overflow/underflow the entire page needs to be moved to avoid having to move several pages Block and sub-page overflows/underflows are handled in hardware by an off-chip DMA-engine On a page overflow/underflow a trap is taken and the mapping for the page is changed Processor has to stall if it accesses data that is being moved!
16
core L1-cache L2-cache BST Calc. Sub-page 0Sub-page 1 Page 0 CompDec. DMA- engine Putting it all together
17
Experimental Methodology Key issues to experimentally evaluate: Compressibility and impact of fragmentation Performance losses for proposed approach
18
Architectural Parameters Instr. Issue4-w ooo Exec units4 int, 2 int mul/div Branch pred.16-k entr. gshare, 2k BTB L1 I-cache32 k, 2-w, 2-cycle L1 D-cache32 k, 4-w, 2-cycle L2 cache512k/thread, 8-w, 16 cycles Memory latency150 cycles 1 Block lock-out4000 cycles 2 Subpage lock-out23000 cycles 2 Page lock-out23000 cycles 2 Predefined sizes Block0224464 Subpage2565127681024 Page2048409661448192 Loads to a block only containing zeros can retire without accessing memory!
19
Benchmarks SPEC2000 ran with reference set. SAP and SpecJBB ran 4 billion instructions per thread SpecInt200 0 Bzip Gap Gcc Gzip Mcf Parser Perlbmk Twolf Vpr SpecFP2000 Ammp Art Equake Mesa Server SAP S&D SpecJBB
20
Fragmentation - Results
21
Detailed Performance Results
22
Conclusion It is possible to free up significant amounts of memory resources with virtually zero performance overhead This was achieved by exploiting zero-valued bytes which account for as much as and 55% of the memory contents leveraging a fast compression/decompression scheme a fast translation mechanism a hierarchical memory layout which offers some slack at the block, subpage, and page level Overall, 30% of memory could be freed up at a loss of 0.2% on average
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.