ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades Dept. of Computer Science University of Pittsburgh
Tiled CMP Architectures Tiled CMP Architectures have recently been advocated as a scalable design. They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC). A tile typically incorporates a private L1 cache and an L2 cache bank. A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks Shared Scheme
L2 miss The home tile of a cache block B is designated by the HS bits of B’s physical address. Tile T1 requests B. B is fetched from the main memory and mapped at its home tile (together with its dir info). Pros: High capacity utilization. Simple Coherence Enforcement (Only for L1).
Shared Scheme: Latency Problem (Cons) Access latencies to L2 banks differ depending on the distances between requester cores and target banks. This design is referred to as a Non Uniform Cache Architecture NUCA
NUCA Solution: Block Migration T0 requests block B. Move accessed blocks closer to the requesting cores Block Migration B is migrated from T15 to T0. T0 requests B. Local hit Total Hops = 14 Total Hops = 0 HS of B = 1111 (T15)
NUCA Solution: Block Migration T3 requests B (hops = 6). T0 requests B (hops = 8). T8 requests B (hops = 8). Assume B is migrated to T3. T3 requests B (hops = 0). T0 requests B (hops = 11). T8 requests B (hops = 13). Though T0 saved 6 hops, in total there is a loss of 2 hops. Total Hops = 22 Total Hops = 24 HS of B = 0110 (T6)
Our work Collect information about tiles (sharers) that have accessed a block B. Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future. Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.
Talk roadmap Predicting optimal host location Locating Migratory Blocks Cache-the-cache-tag policy. Replacement policy upon migration Swap-with-the-lru policy. Quantitative Evaluation Conclusion and future works
Predicting Optimal Host Location Keeping a cache block B at its home tile might not be optimal. The best host location of B is not known until runtime. Adaptive Controlled Migration (ACM): Keep a pattern for the accessibility of B. At runtime (after a specific migration frequency level is reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B
ACM: A Working Example Tiles 0 and 6 are sharers: Case 1: Tile 3 is a host. Case 2: Tile 15 is a host. Case 3: Tile 2 is a host. Case 4: Tile 0 is the host. Total Latency Cost = 14Total Latency Cost = 22Total Latency Cost = 10Total Latency Cost = 8 Select T0
Locating Migratory Blocks After a cache block B is migrated, the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access. Assume B has been migrated from its home tile T4 to a new host tile T7. T3 requests B: L2 miss. A tag can be kept at T4 to point to T7. Scenario: 3-way cache-to-cache transfer (T3, T4, and T7) Deficiencies: Useless migration. Fails to exploit distance locality False L2 Miss HS of B = 0100 (T4) B at T7
Locating Migratory Blocks: cache-the-cache-tag Policy Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table). T3 requests B. It looks up its MT table before reaching B’s home tile. MT miss: 3-way communication (first access). T3 caches B’s tag at its MT table. T3 requests B. It looks up its MT table before reaching B’s home tile. MT hit: direct fetch (second- and up-accesses) HS of B = 0100 (T4) MT Miss MT Hit
Locating Migratory Blocks: cache-the-cache-tag Policy The MT table of a tile T can now hold 2 types of tags: A tag for each block B whose home tile is T and had been migrated to another tile (local entry). Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry). The MT table replacement policy: An invalid tag. The LRU remote entry. The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.
Replacement Policy Upon Migration: swap-with-lru Policy After the ACM algorithm predicts the optimal host, H, for a block B, a decision is to be made regarding which block to replace at H upon migrating B. Idea: Swap B with the LRU block at H (swap-with-the-lru policy). The LRU block at H could be: A migratory one. A non-migratory one. The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)
Quantitative Evaluation: Methodology and Benchmarks. We simulate a 16-way tiled CMP. Simulator: Simics (Solaris OS) Cache line size: 64 Bytes. L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. L2 size/ways/latency: 512KB per bank/16 ways/6 cycles. Latency per hop: 5 cycles. Memory latency: 300 cycles. Migration frequency level: 10 Benchmarks: NameInput SPECjbbJava HotSpot ™ server VM 1.5, 4 warehouses Lu1024*1024 (16 threads) Ocean514*514 (16 threads) Radix2 M integers (16 threads) Barnes16K particles (16 threads) Parser, Art, Equake, Mcf, Ammp, Vortex Reference MIX1(vortex, Ammp, Mcf, and Equake) MIX2(Art, Equake, Parser, Mcf)
Quantitative Evaluation: Single- threaded and Multiprogramming Results VR successfully offsets the miss rate from fast replica hits for all the single-threaded benchmarks. VR fails to offset the L2 miss increase of MIX1 and MIX2. For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL than S and VR, respectively. For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better AAL than S and VR Poor Capacity Utilization Maintains Efficient Capacity Utilization
Quantitative Evaluation: Multithreaded Results An increase in the degree of sharing suggests that the capacity occupied by replicas could increase significantly leading to a decrease in the effective L2 cache size. ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.
Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr. ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multipr ogramming workloads, respectively. ACM performs on average 20.7% better than S for multithreaded workloads. VR performs on average 15.1% better than S, and 38.4% worse than S for the single- threaded and multiprogramming workloads, respectively. VR performs on average 19.6% worse than S for multithreaded workloads.
Quantitative Evaluation: ACM Scalability Poor Capacity Utilization As the number of tiles on a CMP platform increases, the NUCA problem exacerbates. ACM is independent of the underlying platform and always selects hosts that minimize AAL. More Exposure to the NUCA problem translates effectively to a larger benefit from ACM. For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S. With 32-way CMP, ACM improves AAL by 56.6% on average over S.
Quantitative Evaluation: Sensitivity to MT table Sizes. With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to the L2 cache bank size).
Quantitative Evaluation: Sensitivity to L2 Cache Sizes. AAL maintains improvement of 39.7% over S. VR fails to demonstrate stability.
Conclusion This work proposes ACM, a strategy to manage CMP NUCA caches. ACM offers: Better average L2 access latency over traditional NUCA (20.4% on average). Maintains L2 miss rate of NUCA. ACM proposes a robust location strategy (cache-the-cache- tag) that can work for any NUCA migration scheme. ACM reveals the usefulness of migration technique in CMP context.
Future works Improve ACM prediction mechanism. Currently: Cores are treated equally (we consider only the case with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t). Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead). Propose an adaptive mechanism for selecting migration frequency levels.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Special thank to Socrates Demetriades Dept. of Computer Science University of Pittsburgh Thank you!