Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Slides:



Advertisements
Similar presentations
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Thank you for your introduction.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Performance of Cache Memory
Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Improving Cache Performance by Exploiting Read-Write Disparity
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Sampling Dead Block Prediction for Last-Level Caches
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
The Evicted-Address Filter
CS161 – Design and Architecture of Computer
Improving Cache Performance using Victim Tag Stores
CS161 – Design and Architecture of Computer
Alex Kogan, Yossi Lev and Victor Luchangco
Architecture Background
Cache Memory Presentation I
Lecture 23: Cache, Memory, Security
Prefetch-Aware Cache Management for High Performance Caching
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Milad Hashemi, Onur Mutlu, Yale N. Patt
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture 23: Cache, Memory, Virtual Memory
Using Dead Blocks as a Virtual Victim Cache
Lecture 22: Cache Hierarchies, Memory
Ka-Ming Keung Swamy D Ponpandi
ICIEV 2014 Dhaka, Bangladesh
Performance metrics for caches
Performance metrics for caches
MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen,
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Performance metrics for caches
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Lecture 22: Cache Hierarchies, Memory
Samira Khan University of Virginia Nov 14, 2018
Performance metrics for caches
Increasing Effective Cache Capacity Through the Use of Critical Words
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Performance metrics for caches
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture Samira Khan*†, Alaa R. Alameldeen*, Chris Wilkerson*, Jaydeep Kulkarni* and Daniel A. Jiménez§ *Intel Labs †Carnegie Mellon University §Texas A&M University

Summary Problem: Cache cells become unreliable at low voltage Mixed-cell cache: Use some larger robust cells [Ghasemi 2011] Smaller non-robust cells are turned off at low voltage Capacity loss leads to performance loss Goal: No capacity loss at low voltage to gain high performance Observation: A clean line has a duplicate copy in the memory hierarchy A modified line is the only existing copy Our Approach: Protect a modified line in larger robust cells Store a clean line in smaller non-robust cells Fetch data from the lower level on an error in a clean line Todays systems are power limited. In a given power budget, We can activate more cores if we lower the voltage. The problem is caches become unreliable in the low voltage. Prior work has proposed to use mixed cell cache To solve this problem. Mixed cell cache has some larger robust cells, these cells are reliable even in low voltage. At higher voltage both Robust and non-robust cells are operational. In the low voltage regular cells are turned off. This capacity loss leads to significant performance loss in the low voltage. The goal of this work is to gain high performance using the whole cache The intuition is, a clean line has extra copies in the memory hierarchy, So there is natural redundancy of data in the clean lines. But a modified line is the only existing copy. So a modified line is critical and needs to be protected. We propose to use the whole cache by protecting modified data in the robust cells, so that there is no error. And store the clean data in the Non-robust cells. On a error to a clean line, we propose to fetch the data from the lower level of the memory hierarchy. Our design significantly improves performance and reduces power over the prior work. Significantly improves performance and reduces power compared to prior work

Outline Summary Background and Motivation Mixed-Cell Cache Architecture Methodology and Results Conclusion

Background and Motivation Multi-core designs are power-limited Can activate more cores by lowering the voltage Todays systems are power limited. In a given power budget, we can activate more cores if we can lower the Voltage. In the figure the system can activate just two cores in its power budget. However, if it lowers the Voltage, it can turn on 4 cores. If we lower the voltage more, we can activate 8-cores. So lowering voltage Enables more active cores on the die. Voltage Scale More active cores at low voltage

Ensuring Resiliency at Lower Voltage Cache cells begin to fail at lower voltage Non-robust Robust Error Cache Mixed-Cell Cache But the problem with lowering the voltage is that cache cells begin to fail at low voltage. In the picture the black rectangle represents error. To solve this problem previous work Has proposed mixed cell cache. Mixed cell cache has some larger robust cells. These cells Can work at the lower voltage. However the have area and power overhead. In the left Picture, the bigger cells are the robust cells. At the low voltage, only robust cells are Operational and the other regular cells are disabled. Capacity loss at the lower voltage can Degrade performance, especially in the multi-core systems. Mixed-Cell Cache [Ghasemi 2011] Some ways built with robust cells + Resilient to error at low voltage - Area and power overhead Only robust cells are operational at low voltage Cache capacity loss at lower voltage can degrade performance significantly

Effect of Cache Capacity Reduction in a 4-Core System In this slide we show the effect of cache capacity reduction in a 4-core system. Here the cores are running at lower voltage, so the frequency gets scaled down Too. The cores are running at 825MHz. In the x axis we show two configurations, one is Using the whole cache and the other one id the prior work, that disables 75% of the cache. Here all the level of the caches are mixed-cell cache. The benchmarks we have used are multi-programmed SPEC 2006. This graph shows that 75% reduction in cache Capacity loss leads to 20% slow down on average. In our experiments, 75% reduction in cache capacity leads to 20% performance loss on average

Improve performance using the whole cache Goal: Improve performance using the whole cache at low voltage

Outline Summary Background and Motivation Mixed-Cell Cache Architecture Methodology and Results Conclusion Now I will describe the mixed-cell architecture that we propose.

Our Mixed-Cell Architecture Observation: A Clean line has a duplicate copy in the memory hierarchy On an error, can get the data from the duplicate copy A Modified line is the only copy in the system Critical to keep the data error free Idea: Protect a modified line using larger robust cells Store a clean line in smaller non-robust cells Use parity/ECC to detect errors in clean lines Fetch data from the lower level on an error in clean lines The intuition behind the work is a clean line has a duplicate copy in the memory hierarchy. So if there is a error, we can get the from the duplicate copy. However a modified line is the only existing copy in the system. So it is critical to keep the data error free. We propose to protect the modified lines In the robust cells. And store the clean lines in non-robust cells. We use parity/ error correction Code to detect errors in the clean lines. On an error in a clean line, we will treat it as a miss And issue a miss to get the data from the lower-level of the memory hierarchy.

Our Mixed-Cell Architecture Non-robust Robust Clean Modified Mixed-cell (Disable) [Ghasemi 2011] Our Design In this slide we show our proposed mechanism pictorially. In the left picture We show the mixed cell cache where non-robust cells are disabled in the low Voltage. We want to use both the robust and non-robust ways in the low voltage. In the right side, we show our design. We store modified data only in Robust ways and store clean lines only in non-robust ways. And we Cache only cache management policy to ensure that modified lines Are always protected in the robust ways. Use both robust and non-robust ways at low voltage A modified line is stored only in a robust way A clean line is stored only in a non-robust way Modify cache management techniques to ensure clean and modified lines are stored appropriately

Mixed-Cell Architecture: Cache Miss Write miss: Allocate line in a robust way Read miss: Allocate line in a non-robust way X Y A B On a cache miss we allocate the data depending on the type off the miss. On a read miss, we allocate the line a robust way and on a write miss We allocate the data in a non-robust way. In this figure, write misses is allocated in a robust ways and read misses Are allocated in non-robust way. LRU LRU LRU LRU Write miss X Write miss Y Read miss A Read miss B Time

Mixed-Cell Architecture: Cache Hit F G H I J K L Write Hit E Write Hit G Read Hit J Read hit: No change Write hit: Write hit in robust: No change Write hit in non-robust: We propose three mechanisms Writeback Swap Duplicate So let’s se what happens on a cache hit. If we have a read hit, We are just reading from a line, we re not changing the clean/dirty Status of that line. So we do not need to change anything. If we have a write hit, we need to consider two cases. If the write hits in a non-robust way, The modified line is already protected in the larger cells. So again we do not have to change anything. However, if the write hits in a non-robust way, that means that the line was allocated on a read miss And now we are writing to it. So now we have a modified line vulnerable in a non-robust way. We propose three simple mechanisms to solve this case. They are writeback, swap and duplicate.

Write to a Non-Robust Line: Writeback Write it back in the next level of memory hierarchy Make data clean in the non-robust cell Write Hit G E F G H I J K L In the writeback mechanism, we propose to write the data back to next level of memory hierarchy And make the data clean. For example, if we have write hit at G, now G becomes vulnerable. So we writeback G so that non-robust cell Contains clean data. The advantage of this mechanism is that this is simple. But the advantage is now we have an extra writeback at each write in a non-robust cell. Now this block contains clean data Dirty block in non-robust way is vulnerable, writeback G + Simple - An extra writeback at each write in a non-robust way

Write to a Non-Robust Line: Swap Swap modified line with the LRU robust line Writeback the robust data to next cache level Write Hit G E F G H I J K L The next mechanism we have is swap. In this mechanism, we propose to swap the modified line with the LRU Robust line. Then we writeback the robust data to the next level of hierarchy. In this example, if we have write hit to G, we swap G and E and writeback E. The advantage of this mechanism is that it increases write hits in the robust cells. This is mainly because writes have locality. The disadvantage is now we have to swap two cache lines, so there is the extra latency For the swap. Now this block contains clean data Swap E and G, E is now vulnerable, writeback E + Increases write hits in robust cells - Extra latency for swap

Write to a Non-Robust Line: Duplicate Pair two non-robust ways Static pairing: way <0,1>, <2,3>… Duplicate the data in the partner way On an error in one way, use data from the partner way Write Hit G The next mechanism that we have is duplicate. In this mechanism, we pair up two non-robust ways. We are using static pairing, so way 0 is always paired with way 1 and so on. In case of a write to a non-robust way, we duplicate the data in the partner way. If we have error in one copy, We can use the other copy to get the actual data. The probability that both of the ways will have error Is very very low. In this example, we have write hit at G. So we have duplicate G in its partner way. The advantage of this mechanism is that it is simple and does to require any extra writeback. The disadvantage is we are losing some cache capacity due to duplication and there is Also latency involved in the duplication. E F G G H I J K L , Duplicate G in the partner way + Simple, no extra writeback - Capacity loss, extra latency for duplication

Outline Summary Background and Motivation Mixed-Cell Cache Design Methodology and Results Conclusion

Evaluation Methodology Simulator: CMP$im, a Pin-based x86 simulator [Jaleel 2008] Benchmarks: 20 4-core multi-programmed mixes from SPEC 2006 Each cache has 2 robust ways L1D 32KB, 2 robust, 6 non-robust ways, 3 cycles L2 256KB, 2 robust, 6 non-robust ways, 10 cycles L3 shared 4MB, 2 robust, 14 non-robust ways, 25 cycles Memory latency 80 cycles Vmin 590 mV, 825 MHz

Comparison Points Robust: Cache uses only robust cells Smaller capacity, L1D 20KB, L2 160KB, L3 2.25MB Disable: Mixed-Cell Cache [Ghasemi 2011] Only ¼ of the cache works at low voltage, L1D 8KB, L2 64KB, L3 1MB Ideal: Cache uses only non-robust cells Larger capacity, L1 40KB, L2 320KB, L3 4.5MB Can not work at low voltage Can provide higher voltage to cache using a separate Vcc Increases complexity Adds latency to signals crossing voltage domains Now I will talk about our comparison points. We compare our mechanisms with Robust where all the cache cells are made of robust cells. But it has smaller capacity since the robust cells are bigger. We compare are mechanisms with Disable, which is the prior work where only one fourth of the cache is on At low voltage. We also compare with Idea. Here the cache has only smaller cells. So it has larger cache capacity. But note that This cells would not work at low voltage. One way to make them work is to use separate power supplies for the core And the cache, so the core can run at a lower voltage but the cache would have a higher voltage. But using separate Voltage comes with additional cost and complexity.

4-Core Performance at Low Voltage 17% 2.6% Here we show the performance of the 4 core system. In the axis we show all the mechanisms and in the Y axis we show the Weighted speedup normalized to Disable, where 3/4th of the cache is off at low voltage. The blue bars are our mechanisms. Robust performs better than disable as it has higher cache capacity. Our first mechanism writeback actually degrades performance. It shows that extra writebacks at write hits in the non-robust ways has a significant impact on the performance. The next two bars Are Swap and duplicate. They both improve the performance. On average swap improves performance by 17% compared to Disable and is within 2.6% of the Idea case. Swap provides 17% speedup over Disable Swap performs within 2.6% of Ideal

Normalized Memory Bandwidth 6.15 21% 28.5% 3% Since our mechanisms introduce extra writebacks, we show the normalized memory bandwidth in this graph. Again in the x axis we have all the mechanisms and we show the normalized memory bandwidth in the y axis Compared to prior work disable. Robust uses less bandwidth than disable as disable has smaller capacity it Issues more misses. Our mechanism writeback increases memory bandwidth by more than 6X. However, both Swap and duplicate uses lower bandwidth. Actually duplicate increases memory bandwidth by only 3% Compared to idea case. Duplicate increases memory bandwidth by only 3% compared to Ideal

Normalized LLC Static Power at Vmin (590mV) 10% 2.3X Since robust cells has lore static and dynamic power, we show the power consumption of different Mechanisms. First we show the LLC static power as that is leakage is the major power component In LLC. Again in the x axis we show the mechanisms and in the y axis we show the normalized LLC Static power compared to Disable. Robust uses more static power as it is bigger than disable. Both swap and duplicate actually increases static power as they are bigger and they uses larger Robust cells. However they reduce static power by 10% compared to Ideal case, where cache and Core uses separate power supply, without the additional overhead of the separate power supplies. Swap and Duplicate reduce LLC static power by 10% compared to Ideal

Normalized L1D Dynamic Power at Vmin (590mV) 22% 50% Here we show the dynamic power of L1 data cache. As before, in the x axis we have all the mechanisms And in the y axis we have normalized dynamic power compared to disable. Disable and robust has same Dynamic power as both of them are comprised of Only larger cells. So all the accesses are served by the Larger cells. Swap had duplicate significantly improve he dynamic power. However duplicate reduces power more than The swap mechanism. The reason is swap moves the modified lines in the robust ways. This increases write hits In the robust ways. But duplicate actually spreads the writes in the non-robust ways, so swap have more access At larger cells than the duplicate. Duplicate reduces dynamic power by 50% compared to Disable and is within 30% of the the ideal case. 30% Duplicate reduces dynamic power by 50% compared to Disable Duplicate is within 30% of the Ideal

Conclusion Problem: Cache cells become unreliable at low voltage Mixed-cell cache: Use some larger robust cells Smaller non-robust cells are turned off at low voltage Capacity loss leads to performance loss Goal: No capacity loss at low voltage to gain high performance Observation: A clean line has a duplicate copy in the memory hierarchy A modified line is the only existing copy Our Approach: Protect a modified line in larger robust cells Store a clean line in smaller non-robust cells Fetch data from the lower level on an error in a clean line The problem is caches become unreliable in the low voltage. Prior work has proposed to use mixed cell cache To solve this problem. Mixed cell cache has some larger robust cells, these cells are reliable even in low voltage. At higher voltage both Robust and non-robust cells are operational. In the low voltage regular cells are turned off. This capacity loss leads to significant performance loss in the low voltage. The goal of this work is to gain high performance using the whole cache The intuition is clean lines have extra copy in the memory hierarchy, So there is natural redundancy of data in the clean lines. But modified lines are the only existing copy. So modified lines are critical and they need to be protected. We propose to use the whole cache by protecting modified data in the robust cells and storing the clean data in the Non-robust cells. On a error to a clean line, we propose to fetch the data from the lower level of the memory hierarchy. Our design significantly improves performance and reduces power over the prior work. Improves performance by 17% and reduces L1D dynamic power by 50% compared to prior work

Thank you

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture Samira Khan*†, Alaa R. Alameldeen*, Chris Wilkerson*, Jaydeep Kulkarni* and Daniel A. Jiménez§ *Intel Labs †Carnegie Mellon University §Texas A&M University