Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Similar presentations


Presentation on theme: "1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian."— Presentation transcript:

1 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

2 2University of Utah2 Large Caches  Cache hierarchies will dominate chip area  3D stacked processors with an entire die for on-chip cache could be common  Montecito has two private 12 MB L3 caches (27MB including L2)  Long global wires are required to transmit data/address Intel Montecito Cache

3 3University of Utah3 Wire Delay/Power  Wire delays are costly for performance and power  Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  CACTI* access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech *version 4

4 Contribution  Support for various interconnect models Improved design space exploration  Support for modeling Non-Uniform Cache Access (NUCA) University of Utah4

5 5 5 Cache Design Basics Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

6 6University of Utah6 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay

7 7 Power/Delay Overhead of Wires  H-tree delay increases with cache size  H-tree power continues to dominate  Bitlines are other major contributors to total power

8 8 Motivation  The dominant role of interconnect is clear  Lack of tool to model interconnect in detail can impede progress  Current solutions have limited wire options Orion, CACTI -Weak wire model -No support for modeling Multi-megabyte caches University of Utah8

9 9 CACTI 6.0 Enhancements  Incorporation of Different wire models Different router models Grid topology for NUCA Shared bus for UCA Contention values for various cache configurations  Methodology to compute optimal NUCA organization  Improved interface that enables trade-off analysis  Validation analysis University of Utah9

10 10 Full-swing Wires University of Utah10 X Y Z

11 11 Full-swing Wires II University of Utah11 10% Delay penalty 20% Delay penalty 30% Delay penalty Repeater size  Caveat: Repeater sizing and spacing cannot be controlled precisely all the time Three different design points

12 12 Full-Swing Wires  Fast and simple Delay proportional to sqrt(RC) as against RC  High bandwidth Can be pipelined -Requires silicon area -High energy -Quadratic dependence on voltage

13 13 Low-swing wires University of Utah13 400mV 50mV raise Differential wires 50mV drop 400mV

14 14 Differential Low-swing +Very low-power, can be routed over other modules -Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver  Bitlines are a form of low-swing wire  Optimized for speed and area as against power  Driver and pre-charger employ full Vdd voltage University of Utah14

15 15 Delay Characteristics University of Utah15 Quadratic increase in delay

16 16 Energy Characteristics University of Utah16

17 17 Search Space of CACTI-5 University of Utah17  Design space with global wires optimized for delay

18 18 Search Space of CACTI-6 University of Utah18 Design space with global and low-swing wires Least Delay 30% Delay Penalty Low-swing

19 19University of Utah19 CACTI – Another Limitation  Access delay is equal to the delay of slowest sub- array  Very high hit time for large caches  Employs a separate bus for each cache bank for multi-banked caches  Not scalable Exploit different wire types and network design choices to improve the search space Potential solution – NUCA Extend CACTI to model NUCA

20 20University of Utah20 Non-Uniform Cache Access (NUCA)*  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks *(Kim et al. ASPLOS 02)

21 21University of Utah21 Extension to CACTI  On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline  Network latency vs Bank access latency tradeoff Iterate over different bank sizes Calculate the average network delay based on the number of banks and bank sizes Consider contention values for different cache configurations  Similarly we also consider power consumed for each organization

22 22 Trade-off Analysis (32 MB Cache) 16 Core CMP

23 23 Effect of Core Count

24 24University of Utah24 Power Centric Design (32MB Cache)

25 Validation  HSPICE tool  Predictive Technology Model (65nm tech.)  Analytical model that employs PTM parameters compared against HSPICE  Distributed wordlines, bitlines, low-swing transmitters, wires, receivers Verified to be within 12% University of Utah25

26 26 Case Study: Heterogeneous D-NUCA  Dynamic-NUCA Reduces access time by dynamic data movement Near-by banks are accessed more frequently  Heterogeneous Banks  Near-by banks are made smaller and hence faster  Access to nearby banks consume less power  Other banks can be made larger and more power efficient

27 27 Access Frequency  % request satisfied by x KB of cache

28 Few Heterogeneous Organizations Considered by CACTI University of Utah28 Model 1 Model 2

29 29 Other Applications  Exposing wire properties Novel cache pipelining  Early lookup, Aggressive lookup (ISCA 07) Flit-reservation flow control (Peh et al., HPCA 00) Novel topologies  Hybrid network (ISCA 07)

30 30 Conclusion  Network parameters and contention play a critical role in deciding NUCA organization  Wire choices have significant impact on cache properties  CACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25% http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html http://www.cs.utah.edu/~rajeev/cacti6/


Download ppt "1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian."

Similar presentations


Ads by Google