Download presentation
Presentation is loading. Please wait.
1
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian
2
University of Utah2 Motivation: Large Caches Future processors will have large on-chip caches Intel Montecito has 24MB on-chip cache Wire delay dominates in large caches Conventional design can lead to very high hit time (CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech) Careful network choices Improve access time Open room for several other optimizations Reduces power significantly
3
University of Utah3 Effect of L2 Hit Time 8-issue, out-of-order processor (L2-hit time 30-15 cycles) Avg = 17%
4
University of Utah4 Cache Design Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver
5
University of Utah5 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays
6
University of Utah6 Shortcomings CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest sub-array Very high hit time for large caches Employs a separate bus for each cache bank for multi-banked caches
7
University of Utah7 Non-Uniform Cache Access (NUCA) Large cache is broken into a number of small banks Employs on-chip network for communication Access delay (distance between bank and cache controller) CPU & L1 Cache banks
8
University of Utah8 Shortcomings NUCA Banks are sized such that the link latency is one cycle (Kim et al. ASPLOS 02) Increased routing complexity Dissipates more power
9
University of Utah9 Extension to CACTI On-chip network Wire model is done using ITRS 2005 parameters Grid network No. of rows = No. of columns (or ½ the no. of columns) Network latency vs Bank access latency tradeoff Modified the exhaustive search to include the network overhead
10
University of Utah10 Effect of Network Delay (32MB cache) Delay optimal point
11
University of Utah11 Outline Overview Cache Design Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results
12
University of Utah12 Wire Characteristics Wire Resistance and capacitance per unit length ResistanceCapacitanceBandwidth Width Spacing
13
University of Utah13 Design Space Exploration Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing) Delay Bandwidth
14
University of Utah14 Design Space Exploration Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power
15
University of Utah15 Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 5x
16
University of Utah16 Access time for different link types Bank Count Bank Access Time Avg Access time 8x-wires4x-wiresL-wires 16174675 21 3294071 15 6463863 14 12854468 17 25645183 20 512382113 27 10243100133 35 2048399162 51 40963131196 67
17
University of Utah17 Outline Overview Cache Design Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results
18
University of Utah18 Cache Look-Up Total cache access time Network delay (req 6-8 bits to identify the cache Bank) Decoder, Wordline, Bitline delay (req 10-15 bits of address) Comparator, output driver delay (req remaining address for tag match) The entire access happens in a sequential manner Bank access
19
University of Utah19 Early Look-Up Send partial address in L-wires Initiate the bank lookup Wait for the complete address Complete the access L Early lookup (req 10-15 bits of address) Tag match We can hide 60-70% of the bank access delay
20
University of Utah20 Aggressive Look-Up Send partial address bits on L-wires Do early look-up and do partial tag match Send all the matched blocks aggressively L Agg. lookup (req additional 8-bits of address fpr partial tag match) Tag match at cache controller Network delay reduced
21
University of Utah21 Aggressive Look-Up Significant reduction in network delay (for address transfer) Increase in traffic due to false match < 1% Marginal increase in link overhead Additional 8-bits of L-wires compared to early lookup -Adds complexity to cache controller -Needs logic to do tag match
22
University of Utah22 Outline Overview Cache Design Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results
23
University of Utah23 Experimental Setup Simplescalar with contention modeled in detail Single core, 8-issue out-of-order processor 32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization) 32KB I-cache and 32KB D-cache with hit latency of 3 cycles Main memory latency 300 cycles
24
University of Utah24 Cache Models ModelBank Access (cycles) Bank CountNetwork LinkDescription 13512B-wiresBased on prior work 2664B-wiresCACTI-L2 3664B & L–wiresEarly Lookup 4664B & L–wiresAgg. Lookup 5664B & L–wiresUpper bound
25
University of Utah25 Performance Results (Global Wires) Model 2 (CACTI-L2) : Average performance improvement – 11% Performance improvement for L2 latency sensitive benchmarks – 16.3% Model 3 (Early Lookup): Average performance improvement – 14.4% Performance improvement for L2 latency sensitive benchmarks – 21.6% Model 4 (Aggressive Lookup): Average performance improvement – 17.6% Performance improvement for L2 latency sensitive benchmarks – 26.6% Model 6 (L-Network): Average performance improvement – 11.4% Performance improvement for L2 latency sensitive benchmarks – 16.2%
26
University of Utah26 Performance Results (4X – Wires) Wire delay constrained model Performance improvements are better Early lookup performs 5% better Aggressive model performs 28% better
27
University of Utah27 Future Work Heterogeneous network in a CMP environment Hybrid-network Employs a combination of point-to-point and bus for L- messages Effective use of L-wires Latency/bandwidth trade-off Use of heterogeneous wires in DNUCA environment Cache design focusing on power Pre-fetching (Power optimized wires) Writeback (Power optimized wires)
28
University of Utah28 Conclusion Traditional design approaches for large caches is sub-optimal Network parameters play a significant role in the performance of large caches Modified CACTI model, that includes network overhead performs 16.3% better compared to previous models Heterogeneous network has potential to further improve the performance Early lookup – 21.6% Aggressive lookup – 26.6%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.