Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh
CMPMSI’07 02/11/07 Multicore distributed L2 caches L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays” (Distributed L2 caches + switched NoC) NUCA Hardware-based management schemes Private caching Shared caching Hybrid caching Local L2 Cache Processor Core Router
CMPMSI’07 02/11/07 Private and shared caching Private caching: short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement Shared caching: low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency
CMPMSI’07 02/11/07 Other approaches Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004] Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005] Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]
CMPMSI’07 02/11/07 Motivation Miss rate Hit latency What is the optimal balance between miss rate and hit latency?
CMPMSI’07 02/11/07 Talk roadmap Data mapping, a key property [cho and Jin, Micro2006] Two-dimensional (2D) page coloring algorithm Evaluation and results Conclusion and future works
CMPMSI’07 02/11/07 Data mapping Data mapping Memory data location in L2 cache Private caching Data mapping determined by program location Mapping created at miss time No explicit control Shared caching Data mapping determined by address slice number = (block address) % (N slice ) Mapping is static No explicit control
CMPMSI’07 02/11/07 Page Change mapping granularity slice number = (block address) % (N slice) Block granularityPage granularity Page slice number = (page address) % (N slice)
CMPMSI’07 02/11/07 OS controlled page mapping Memory pages Program 1 Program 2 OS PAGE ALLOCATION Virtual address spacePhysical address space
CMPMSI’07 02/11/07 2D page coloring: the problem Page accessmiss Page Network latency / hop = 3 cycles Memory latency = 300 cycles Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles) cost P
CMPMSI’07 02/11/07 2D coloring algorithm Collect L2 reference trace Derive conflict information [Sherwood et al., ICS1999] Page APage CPage B Reference 1Reference 2Reference 3Reference 4
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B000 C000 Conflict Matrix ABC A000 B000 C000 1
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C000
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A010 B100 C110 Conflict Matrix ABC A000 B000 C
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C000
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C110 Conflict Matrix ABC A000 B100 C
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) 2D Page coloring Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C000 Conflict Matrix ABC A000 B100 C110 Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121
CMPMSI’07 02/11/07 2D coloring algorithm (cont’d) 2D Page coloring Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121 #Conflict(color)#Access Cost(color, page#) = ( x mem latency) + x #hop(color) x hop delay) Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors} α x (1-α) x
CMPMSI’07 02/11/07 Experiments setup Experiments were carried out using simulator derived from SimpleScalar toolset. The simulator models a 16-core tile-based CMP. Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB). Profiling2D coloring Timing Simulation Trace Page mapping Tuning α
CMPMSI’07 02/11/07 Optimal page mapping gcc α = 1/64 # of pages x y x y α = 1/256
CMPMSI’07 02/11/07 Access distribution α 1/32 – 1/2048
CMPMSI’07 02/11/07 Relative performance
CMPMSI’07 02/11/07 Value of α
CMPMSI’07 02/11/07 Conclusions With cautious data placement, there is huge room for performance improvement. Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform- ance improvement. This method can also be applied to other optimization target.
CMPMSI’07 02/11/07 Current and future works Dynamic mapping schemes Performance Power Multiprogrammed and parallel workloads
CMPMSI’07 02/11/07 Thank you & Questions?
CMPMSI’07 02/11/07 Private caching 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss L1 miss Local L2 access short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement
CMPMSI’07 02/11/07 Shared caching 1. L1 miss 2. L2 access Hit Miss L1 miss low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency
CMPMSI’07 02/11/07 Performance Performance improvement Over shared caching 141% 150%