HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain IPDPS 2011, Anchorage, AK (USA) – May 17, 2011
Introduction 2 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 NUCA S-NUCA (Static NUCA) One possible location in the NUCA Simple Trivial search of data No leverages locality D-NUCA (Dynamic NUCA) Multiple candidate banks Migration increases complexity Not easy to find data Optimize cache access latency
Motivation 3 Significant performance potential Limited by the access scheme
Access schemes in D-NUCA Directory is not an alternative Needs to update block location on every migration Reduces D-NUCA potentiality Potential bottleneck Algorithmic-based schemes Partitioned multicast (hybrid access scheme) 1st step: Local bank + central banks (9 banks) 2nd step: The other core’s local banks 4 PerformanceEnergy SerialLow ParallelHigh
Serial vs Parallel 5 Reduce the number of messages required per access is crucial
Objectives 6 Optimize NUCA features Provide fast access when the data is near the requesting core Reduce network contention Crucial in both performance and energy
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 7
Methodology Simulation tools: Simics + GEMS CACTI v6.0 Two scenarios: Multi-programmed Mix of SPEC CPU2006 Parallel applications PARSEC Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core)
Baseline architecture D-NUCA cache 8 MBytes 128 Banks Bank: 64 KBytes, 8-way Migration scheme: Gradual Promotion Replacement LRU Access Partitioned Multicast 9 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 10
HK-NUCA Home Knows where to find data in the NUCA cache Home bank knows which other banks have at least one data block that it manages There is a HK-PTR per cache set in all banks HK-PTR
(2) Call Home(3) Parallel access HK-NUCA 12 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 Core 0 (1) Fast access
Managing Home knowledge Actions that provoke an update of HK-PTR: New data enters to the cache Eviction from the NUCA cache Migration movements Migrations are synchronized with HK-PTR updates 13
Overheads Hardware Implementation HK-PTRs Network Home knowledge updates 14 NUCA cache8 MBytes HK-PTRs32 KBytes
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 15
Performance results 16 Overall performance improvement of 4-6%Workloads with high miss rateLow miss rate, but high hit rate in the first two HK-NUCA stages Low miss rate, high hit rate in the parallel access stage of HK-NUCA
HK-NUCA accuracy 17 85% of memory requests send less than 6 messages to the NUCA
On-chip network traffic 18 Avg Messages sent per request Part. Multcast10.03 HK-NUCA (3-steps)3.82 HK-NUCA (2-steps)4.06 Perfect Search1
Energy consumption results 19 HK-NUCA reduces dynamic energy consumption by more than 50%
Outline Introduction and motivation Methodology HK-NUCA Results Conclusions 20
Conclusions D-NUCA enables to take profit of the non-uniformity of NUCA caches D-NUCA benefits are restricted by the access scheme used HK-NUCA is an access scheme for D-NUCA organizations Allows fast accesses to data that is near the requesting core Home knowledge reduces miss resolution time and network contention Outperforms by 6% the best performing access scheme Reduces dynamic energy consumption by 50% 21
HK-NUCA: Boosting data searches in Dynamic NUCA for CMPs Questions? 22
Migration is not the problem 23 S-NUCAD-NUCA Access scheme is the main limitation in D-NUCA