Download presentation
Presentation is loading. Please wait.
1
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain antonio.gonzalez@intel.com ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain carlos.molina@urv.net ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain javier.lira@ac.upc.edu IPDPS 2011, Anchorage, AK (USA) – May 17, 2011
2
Introduction 2 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 NUCA S-NUCA (Static NUCA) One possible location in the NUCA Simple Trivial search of data No leverages locality D-NUCA (Dynamic NUCA) Multiple candidate banks Migration increases complexity Not easy to find data Optimize cache access latency
3
Motivation 3 Significant performance potential Limited by the access scheme
4
Access schemes in D-NUCA Directory is not an alternative Needs to update block location on every migration Reduces D-NUCA potentiality Potential bottleneck Algorithmic-based schemes Partitioned multicast (hybrid access scheme) 1st step: Local bank + central banks (9 banks) 2nd step: The other core’s local banks 4 PerformanceEnergy SerialLow ParallelHigh
5
Serial vs Parallel 5 Reduce the number of messages required per access is crucial
6
Objectives 6 Optimize NUCA features Provide fast access when the data is near the requesting core Reduce network contention Crucial in both performance and energy
7
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 7
8
Methodology Simulation tools: Simics + GEMS CACTI v6.0 Two scenarios: Multi-programmed Mix of SPEC CPU2006 Parallel applications PARSEC Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core)
9
Baseline architecture D-NUCA cache 8 MBytes 128 Banks Bank: 64 KBytes, 8-way Migration scheme: Gradual Promotion Replacement LRU Access Partitioned Multicast 9 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7
10
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 10
11
HK-NUCA Home Knows where to find data in the NUCA cache Home bank knows which other banks have at least one data block that it manages There is a HK-PTR per cache set in all banks. 11 0010110000001010 HK-PTR
12
(2) Call Home(3) Parallel access HK-NUCA 12 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 Core 0 (1) Fast access 0010110000001010
13
Managing Home knowledge Actions that provoke an update of HK-PTR: New data enters to the cache Eviction from the NUCA cache Migration movements Migrations are synchronized with HK-PTR updates 13
14
Overheads Hardware Implementation HK-PTRs Network Home knowledge updates 14 NUCA cache8 MBytes HK-PTRs32 KBytes
15
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 15
16
Performance results 16 Overall performance improvement of 4-6%Workloads with high miss rateLow miss rate, but high hit rate in the first two HK-NUCA stages Low miss rate, high hit rate in the parallel access stage of HK-NUCA
17
HK-NUCA accuracy 17 85% of memory requests send less than 6 messages to the NUCA
18
On-chip network traffic 18 Avg Messages sent per request Part. Multcast10.03 HK-NUCA (3-steps)3.82 HK-NUCA (2-steps)4.06 Perfect Search1
19
Energy consumption results 19 HK-NUCA reduces dynamic energy consumption by more than 50%
20
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 20
21
Conclusions D-NUCA enables to take profit of the non-uniformity of NUCA caches D-NUCA benefits are restricted by the access scheme used HK-NUCA is an access scheme for D-NUCA organizations Allows fast accesses to data that is near the requesting core Home knowledge reduces miss resolution time and network contention Outperforms by 6% the best performing access scheme Reduces dynamic energy consumption by 50% 21
22
HK-NUCA: Boosting data searches in Dynamic NUCA for CMPs Questions? 22
23
Migration is not the problem 23 S-NUCAD-NUCA Access scheme is the main limitation in D-NUCA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.