Skewed Compressed Cache MICRO 2014. Somayeh Sardashti, David A. Wood Computer Sciences Department University of Wisconsin-Madison 안녕하십니까. Micro 2014 SCC 발표를 진행하게 된 안재형 이라고합니다. 꾸벅
SCC Off-chip access -> latency, BW, Power LLC size already big => Effective capacity
Cache Compression Observation : many cache lines have low dynamic range data.
SCC Designing a Compressed Cache. (1) a compression algorithm to compress blocks (2) a compaction mechanism to fit compressed blocks in the cache. *In general, SCC is independent of the compression algorithm in use. = 초창기 캐쉬는 한가지 압축률만 고수한 대신 빠른 lookup과 상대적으로 적은 metadata overhead. 하지만 압축률이 너무 낮아서 Effective capacity가 낮음. Internal fragmentation = extra metadata, indirection
Motivation. How can we design a compressed cache? (Design goal) 1. tightly compacting variable-size compressed blocks. 2. keeping tag and other metadata overhead low 3. allowing fast lookups. => Previous compressed cache designs failed to achieve all these goals.
Compressed Cache Taxonomy 64 -> 16 0-4 VSC compression size IIC-C, DCC – pointer Superblock tag ex) 16 tags per set in a 16-way associative cache. Each tag tracks a 4-block superblock and can map up to 4 cache blocks. How to provide additional tags How to find the corresponding block given a matching tag.
SCC Key Observation. 1) spatial locality 2) compression locality ( neighboring blocks tend to reside in the cache at the same time) 2) compression locality ( neighboring blocks tend to compress similarly )
SCC 48bits PA tag data CF = 2b00 64Byte = 16W CF = 2b01 32B CF = 2b10 Superblock tag CF = 2b11 8B 8B 48bits PA
SSC
SuperBlock Cache
16-way set-associative Cache Address 48bits 47 Cache Block subblock 4-way set associative
SuperBlock 1 Superblock = 8 contiguous blocks = 64Bytes x 8 = 512B Large block 과 다른 점은 한 block만 읽어온다는 점과 각 block에 대해서 valid bit 등 따로 관리한다는 점. 6bits -> 64B 47 11 10 9 8 6 5 Block ID Byte Select
Way group Selection xor Superblock tag Write시에는 Compression factor을 알기 때문에 Way Group을 고를 수 있지만, Read시에는 해당 데이터의 compression factor을 모른다. A10A9와 각 waygroup의 역연산을 통해서 압축률을 알 수 있다. 예를들어서 A의 tag를 찾았을경우 (superblock tag 및 valid bit) way group을 통해 압축률을 알아낼 수 있다. ============================= store – write, cache miss Load – look up 47 11 10 9 8 6 5 xor Superblock tag
47 11 10 9 8 6 5 압축된 블락의 위치를 슈퍼블락 순서에 맞게 배치함으로써 쉽게 찾아낼 수 있도록.
예: 47 11 10 9 8 6 5 = 왜 굳이 hash function? Skewed cache를 위해. 예: = 왜 굳이 hash function? Skewed cache를 위해. = A47-A11은 skewed를 위해. 슈퍼블락마다 다르게 하기 위해서. 문제는 인접한 슈퍼블락이면? = 압축율이 같으면 같은 인덱스를 가지도록 하려고. 최대한 packing하려고. + 압축이 덜될 경우 conflict miss없이 최대한 퍼지게 하기 위해서. 애당초 way group을 통해서 = 왜 A9,A10은 고려안하는지? 4set associative라서
예) = 왜 굳이 hash function? Skewed cache를 위해. = A47-A11은 skewed를 위해. 슈퍼블락마다 다르게 하기 위해서. 문제는 인접한 슈퍼블락이면? = 압축율이 같으면 같은 인덱스를 가지도록 하려고. 최대한 packing하려고. + 압축이 덜될 경우 conflict miss없이 최대한 퍼지게 하기 위해서. 애당초 way group을 통해서 = 왜 A9,A10은 고려안하는지? 4set associative라서 47 11 10 9 8 6 5
2-way Skewed Cache. Index each way with a different hash function, spreading out accesses and reducing conflicting misses.
SCC 16-way cache with 8 cache sets into 4 way groups. = 주소와 CF로 저장되는 way group과 set index가 정해지므로, 따로 압축률에 관한 metadata가 필요하지 않다. 위치가 알려줌. = 편의상 각 way group 젤 앞 way에 할당되는 걸로 보여줌. 원래는 각 way group내에서 skewed cache = X보면 block off-set그대로. 고로 data array 어디에 있는지 추적하기위한 metadata 필요 없음. Datapath단순해지고 룩업도 빠르고 영역도 적게 차지하고 디자인도 쉽고 여전히 여러 압축률 지원함. = way group마다 나뉘므로 실제 16-way 다 활용못함. Conflict miss 증가, 고로 skewed cache 16-way cache with 8 cache sets into 4 way groups. 64Byte cache block, 8-block Superblocks. (1,2,4 or 8 subblocks) Separate sparse super-block tag
SCC * 97% of updated blocks fit in original place. = 처음 look-up시 CF모르므로. 모든 way group 살펴봐야하는데 이 때 index를 알려면 CF를 알아야하므로, 위 계산식으로 도출 후 Lookup. (예. Fig2.A) = tag있고 valid면 hit. Sub-block위치를 위에서 구한 CF와 식(3) Byte Offset을 가지고 구해서 읽어옴. = L1 혹은 L2에서는 uncompressed 였다가 LLC에서 compress시키므로 CF 알고있다 write시에는. Cache miss 시에는 모를텐데? => memory에서 LLC로 데려올 때 compressed 되므로 CF 알고 있음. Way group은 정해지며 이 때 해당 set이 전부 차있으면, eviction 필요. Way group내에서 LRU로 고름. = Ex) write-back to an inclusive LLC = DCC랑 다른점. DCC는 다른 SB에 속하는 blocks eviction할 우려. + SB 쫓겨날 때 속하는 모든 블락도 같이 쫓겨나야함. 고로 구현단순. = * 97% of updated blocks fit in original place.
Area Overhead 1bit FIxedC 3bits per block VSC to locate BPE per subblock in a set. SCC only needs LRU for the tags. No extra data to locate subblocks. Only tag addresses,LRU state, per-block coherence states. Baseline : conventional 16-way 8MB LLC FixedC : doubles the # of tags. Compression only to half the size. VSC : 0-4 16B subblocks DCC4-16 : 0-4 16B subblocks SCC8-8 : 0-8 8B subblocks
Methodology GEMS simulator, CACTI6.5 (area, power at 32nm) Run mixes of multi-programmed workloads from memory bound and compute bound SPEC CPU 2006 benchmarks. Different applications from SPECOMP,PARSEC,commercial workloads, SPEC CPU 2006. Run mixes of multi-programmed workloads from memory bound and compute bound SPEC CPU 2006 benchmarks. = increasing order for the Baseline. = warmed up caches. Average the multiple runs. Baseline : conventional 16-way 8MB LLC 2XBaseline : conventional 32-way 16MB LLC
Evaluation-MPKI 2X Baseline – average 15% improvement SCC – avg. 13%
Evaluation-Energy SCC improves system energy up to 20%. Avg. 6%
Conclusion SCC achieves performance comparable to that of a conventional cache with twice the capacity and associativity with less area overhead 1.5%. (DCC - 6.8%) = Area overhead : SCC 1.5% vs DCC 6.8% Lower design complexity. = Replacement mechanism is simpler than DCC SCC’sreplacementmechanismismuchsimplerthanthat needed by DCC. In DCC, allocating space for a block can trigger the eviction of several blocks, sometimes belonging to different super-blocks. In case of a super-block miss, all blocks associated with the victim super-block tag must be evicted, unlike SCC that evicts only blocks belonging to a particular data entry. In addition, in DCC, blocks belonging to other super-blocks may need to be evicted too. Thus, determining which block or super-block is best to replace in DCC is very complex. SCC also never needs to evict a block on a super-block hit, while DCC may. SCC will allocate the missing block in its corresponding data entry, which is guaranteed to have enough space since the compression factor is used as part of the search criteria. In DCC, a super-block hit does not guarantee that there is any free space in the data array.
FixedC
VSC
DCC
SSC
Sector Cache
2-way Skewed Cache. Index each way with a different hash function, spreading out accesses and reducing conflicting misses.
Cache Compression [Goal] Fast (low decompression latency) Simple (avoid complex hardware changes) Effective (good compression ratio)
Motivation Off-chip memory latency is high. -> larger cache reduce misses at the cost of bigger area and power. Off-chip memory access requires high enerygy. -> larger cache reduce accesses to Off-chip memory. Off-chip interconnects bandwidth is limited. -> larger cache Last level