Presentation is loading. Please wait.

Presentation is loading. Please wait.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Similar presentations


Presentation on theme: "Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio."— Presentation transcript:

1 Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) javierx.lira@intel.comtimothy.jones@cl.cam.ac.uk Carlos Molina (URV, Spain)Antonio González (Intel-UPC, Spain) carlos.molina@urv.netantonio.gonzalez@intel.com HiPEAC 2012, Paris (France) – January 23, 2012

2  CMPs have become the dominant paradigm.  Incorporate large shared last- level caches.  Access latency in large caches is dominated by wire delays. 24 MBytes Intel® 32 MBytes IBM® 32 MBytes Tilera® Nehalem POWER7 Tile-GX 2

3  NUCA divides a large cache in smaller and faster banks.  Cache access latency consists of the routing and bank access latencies.  Banks close to cache controller have smaller latencies than further banks. Processor [1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02 3

4  Data can be mapped in multiple banks.  Migration allows data to adapt to application’s behaviour. 4 S-NUCAD-NUCA Migration movements are effective, but about 50% of hits still happen in non-optimal banks.

5  Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 5

6 MigrationPlacement Access Replacement Placement Access Migration Replacement Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 16 positions per data Partitioned multicast Gradual promotion LRU + Zero-copy Core 0 [2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04 6

7 Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core) GEMS Simics Solaris 10 PARSEC SPEC CPU2006 8 x UltraSPARC IIIi Ruby Garnet Orion 7

8  Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 8

9  Uses prefetching principles on data migration.  This not a traditional prefetcher. ◦ It does not bring data from main memory. ◦ Potential benefits are much restricted.  Require simple data correlation. 9

10 Core 0Core 1Core 2Core 3 Core 4 Core 5Core 6Core 7 Next AddressBank @Data block NAT PS 10 B 5 A B

11  Fraction of prefetching requests that ended up being useful. 11 1 confidence bit is effective. > 1 bit is not worthy.

12  Percentage of prefetching requests submitted with other address’s information. 12 12-14 bits use about 25% of erroneous information. NAT with 12 addressable bits is 232 KBytes in total.

13  Percentage of prefetching requests that are found in the NUCA cache. 13 Predicting data location in based on the last appearance provides 50% accuracy. Accuracy increases accessing to local bank.

14  The realistic Migration Prefetcher uses: ◦ 1-bit confidence for data patterns. ◦ A NAT with 12 addressable bits (29KBytes/table). ◦ Last responder + Local as search scheme.  Total hardware overhead is 264 KBytes.  Latency: 2 cycles. 14

15  Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 15

16 16

17  Achieves overall performance improvements of 4%, and up to 17%.  NUCA is up to 25% faster with the Migration Prefetcher.  Reduces NUCA cache latency by 15%, on average. 17

18  This technique does not increase energy consumption.  The prefetcher introduces extra traffic into the network.  In case of hit, reduces the number of messages significantly. 18

19  Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 19

20  Existing migration techniques effectively concentrate most accessed data to banks that are close to the cores.  About 50% of hits in NUCA are in non-optimal banks.  The Migration Prefetcher anticipates migrations based on the past.  It reduces the average NUCA latency by 15%.  Outperforms the baseline configuration by 4%, on average, and does not increase energy consumption. 20

21 Questions? HiPEAC 2012, Paris (France) – January 23, 2012


Download ppt "Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio."

Similar presentations


Ads by Google