Presentation is loading. Please wait.

Presentation is loading. Please wait.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Similar presentations


Presentation on theme: "Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,"— Presentation transcript:

1 Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC, Spain) dbrooks@eecs.harvard.eduantonio.gonzalez@intel.com HiPC 2011, Bangalore (India) – December 21, 2011

2 CMPs incorporate large LLC. POWER7 implements L3 cache with eDRAM. 3x density. 3.5x lower energy consumption. Increases latency few cycles. We propose a placement policy to accomodate both technologies in a NUCA cache. 40-45% chip area 2

3 NUCA divides a large cache in smaller and faster banks. Cache access latency consists of the routing and bank access latencies. Banks close to cache controller have smaller latencies than further banks. Processor [1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS02 3

4 SRAM provides high-performance. eDRAM provides low power and high density. SRAMeDRAM LatencyX1.5x DensityX3x Leakage2xX Dynamic energy 1.5xX Need refresh? NoYes 4

5 Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 5

6 MigrationPlacement Access Replacement Placement Access Migration Replacement Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 16 positions per data Partitioned multicast Gradual promotion LRU + Zero-copy Core 0 [2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO04 6

7 Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core) GEMS Simics Solaris 10 PARSEC SPEC CPU2006 8 x UltraSPARC IIIi Ruby Garnet Orion 7

8 Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 8

9 Fast SRAM banks are located close to the cores. Slower eDRAM banks in the center of the NUCA cache. PROBLEM: Migration tends to concentrate shared data in central banks. 9 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 eDRAM SRAM

10 Significant amount of data in the LLC are not accessed during their lifetime. SRAM banks store most frequently accessed data. eDRAM banks allocate data blocks that either: Just arrived to the NUCA, or Were evicted from SRAM banks. 10

11 First goes to an eDRAM. If accessed, it moves to SRAM. Features: Migration between SRAM banks. Lack of communication in eDRAM. No eviction from SRAM banks. eDRAM is extra storage for SRAM. PROBLEM: Access scheme must search to the double number of banks. eDRAM SRAM Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 11

12 Tag Directory Array (TDA) stores tags of eDRAM banks. Using TDA, the access scheme looks up to 17 banks. TDA requires 512 Kbytes for an 8 Mbyte (4S-4D) hybrid NUCA cache. 12

13 Heterogeneous + TDA outperforms the other hybrid alternatives. 13 We use Heterogeneous + TDA as hybrid NUCA cache in further analysis.

14 Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 14

15 Well-balanced configurations achieve similar performance as all-SRAM NUCA cache. The majority of hits are in SRAM banks. 15

16 Hybrid NUCA pays for TDA. The less SRAM the hybrid NUCA uses, the better. 16

17 Similar performance results as all-SRAM. Reduces power consumption by 10%. Occupies 15% less area than all-SRAM. 4S-4D 17

18 Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 18

19 19 all SRAM banks SRAM: 4MBytes eDRAM: 4MBytes 15% reduction on area +1MByte in SRAM banks +2MBytes in eDRAM banks 5S-4D 4S-6D SRAM eDRAM

20 And do not increase power consumption. Both configurations increases performance by 4%. 20

21 Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 21

22 IBM® integrates eDRAM in its latest general-purpose processor. We implement a hybrid NUCA cache, that effectively combines SRAM and eDRAM technologies. Our placement policy succeeds in concentrating most accesses to the SRAM banks. Well-balanced hybrid cache achieves similar performance as all- SRAM configuration, but occupies 15% less area and dissipates 10% less power. Exploiting architectural benefits we achieve up to 10% performance improvement, and by 4%, on average. 22

23 Questions? HiPC 2011, Bangalore (India) – December 21, 2011


Download ppt "Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,"

Similar presentations


Ads by Google