Download presentation
Presentation is loading. Please wait.
Published byAugustus Perry Modified over 6 years ago
1
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya Dwarkadas†, Lesley Shannon∗ School of Computing Sciences, Simon Fraser University †Department of Computer Science, University of Rochester ∗School of Engineering Science, Simon Fraser University IEEE/ACM 45th Annual International Symposium on Microarchitecture
2
Overview Problem being addressed Prior work Challenges encountered
Proposed solution Results Review of the paper
3
Problem Statement Current caches are designed for fixed block size
Block granularity is decided based on average spatial locality across general workload Many common applications exhibit fairly lower spatial localities than the design point Unused words occupy between 17—80% of a 64K L1 cache and between 1%—79% of a 1MB private LLC Unused-word transfers comprise 11% of on-chip cache hierarchy energy consumption.
4
Prior Work Sector Cache: Aims at minimizing bandwidth Fetches sub blocks Works well for applications with low-moderate spatial locality Reduces mispredicted spatial prefetches thus reducing bandwidth usage and energy consumption
5
Prior Work
6
Challenges to static design
Small cache lines tend to fetch fewer unused words Impose significant performance penalties by missing opportunities for spatial prefetching in applications with high spatial locality Larger cache line sizes effectively prefetch neighboring words Increase number of unused words and network bandwidth. Determining a fixed optimal point for the cache line granularity at hardware design time is a challenge
7
Amoeba Cache A novel cache architecture that supports fine grain (per-miss) dynamic adjustment of cache block size and the # of blocks per set. Filters out unused words in a block and prevents them from being inserted into the cache, allowing the resulting free space to hold other useful blocks. Can adapt to the available spatial locality
8
Amoeba Cache
9
Amoeba Cache How to grow and shrink the # of tags as the # of blocks per set vary with block granularity? Eliminates Tag array Tags and Data are kept together in a single data array Bitmaps indicate which words in the data array are tags Valid bits are also stored as bitmap
10
Amoeba Cache
11
Amoeba Cache Data lookup:
Tag bitmap activates words from the array containing tags for comparison Minimum size of Amoeba block is 2 words(1 tag 1 data), adjacent words cannot be tags
12
Amoeba Cache Block Insertion:
The Valid bitmap is used to determine empty slots within the set 1 means allocated, 0 means empty For an incoming block with m words, m consecutive 0’s are searched The replacement algorithm is triggered repeatedly until space is created To reclaim space from Amoeba block, Tag bits and Valid bits from the bitset corresponding to the block are unset. Uses LRU policy to choose a way within the cache and randomly picks a random candidate from within the set for block replacement
13
Partial Misses: Low probability (5 in 1K accesses) Identify overlapping blocks Evict to MSHR Allocate space for entire block Miss request Block copied
14
Results Result 1:Amoeba-Cache increases cache capacity by harvesting space from unused words and can achieve an 18% reduction in both L1 and L2 miss rate. Result 2:Amoeba-Cache adaptively sizes the cache block granularity and reduces L1↔L2 bandwidth by 46% and L2↔Memory bandwidth by 38%.
15
Results
16
Results Result 3: Boosts performance by 10% on commercial applications saving 11% energy of on-chip memory hierarchy. Off-chip L2 to MM sees a mean energy reduction of 41% across all workloads.
17
Review of the paper Connects proposed work with prior work
Builds up on the proposed idea gradually with sufficient examples Algorithms explained well with control flow diagrams Lots of comparative graphs to support the results The maximum region size(RMAX) stated differently in text(bytes) and in diagram(words). The reason for using of the metric 1/(MissRate×Bandwidth) for determining block granularity could have been supported better.
18
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.