1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Slides:

Advertisements

Similar presentations

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.

Lecture 17: Virtual Memory, Large Caches

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

On-Chip Networks and Testing

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Advanced Caches Smruti R. Sarangi.

Lecture: Large Caches, Virtual Memory

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Architecture & Organization 1

Lecture 12: Cache Innovations

Architecture & Organization 1

CS 6290 Many-core & Interconnect

A Case for Interconnect-Aware Architectures

Presentation transcript:

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

2University of Utah2 Large Caches  Cache hierarchies will dominate chip area  3D stacked processors with an entire die for on-chip cache could be common  Montecito has two private 12 MB L3 caches (27MB including L2)  Long global wires are required to transmit data/address Intel Montecito Cache

3University of Utah3 Wire Delay/Power  Wire delays are costly for performance and power  Latencies of 60 cycles to reach ends of a chip at 32nm 5 GHz)  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  CACTI* access time for 24 MB cache is 90 5GHz, 65nm Tech *version 4

Contribution  Support for various interconnect models Improved design space exploration  Support for modeling Non-Uniform Cache Access (NUCA) University of Utah4

5 5 Cache Design Basics Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

6University of Utah6 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay

7 Power/Delay Overhead of Wires  H-tree delay increases with cache size  H-tree power continues to dominate  Bitlines are other major contributors to total power

8 Motivation  The dominant role of interconnect is clear  Lack of tool to model interconnect in detail can impede progress  Current solutions have limited wire options Orion, CACTI -Weak wire model -No support for modeling Multi-megabyte caches University of Utah8

9 CACTI 6.0 Enhancements  Incorporation of Different wire models Different router models Grid topology for NUCA Shared bus for UCA Contention values for various cache configurations  Methodology to compute optimal NUCA organization  Improved interface that enables trade-off analysis  Validation analysis University of Utah9

10 Full-swing Wires University of Utah10 X Y Z

11 Full-swing Wires II University of Utah11 10% Delay penalty 20% Delay penalty 30% Delay penalty Repeater size  Caveat: Repeater sizing and spacing cannot be controlled precisely all the time Three different design points

12 Full-Swing Wires  Fast and simple Delay proportional to sqrt(RC) as against RC  High bandwidth Can be pipelined -Requires silicon area -High energy -Quadratic dependence on voltage

13 Low-swing wires University of Utah13 400mV 50mV raise Differential wires 50mV drop 400mV

14 Differential Low-swing +Very low-power, can be routed over other modules -Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver  Bitlines are a form of low-swing wire  Optimized for speed and area as against power  Driver and pre-charger employ full Vdd voltage University of Utah14

15 Delay Characteristics University of Utah15 Quadratic increase in delay

16 Energy Characteristics University of Utah16

17 Search Space of CACTI-5 University of Utah17  Design space with global wires optimized for delay

18 Search Space of CACTI-6 University of Utah18 Design space with global and low-swing wires Least Delay 30% Delay Penalty Low-swing

19University of Utah19 CACTI – Another Limitation  Access delay is equal to the delay of slowest sub- array  Very high hit time for large caches  Employs a separate bus for each cache bank for multi-banked caches  Not scalable Exploit different wire types and network design choices to improve the search space Potential solution – NUCA Extend CACTI to model NUCA

20University of Utah20 Non-Uniform Cache Access (NUCA)*  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks *(Kim et al. ASPLOS 02)

21University of Utah21 Extension to CACTI  On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline  Network latency vs Bank access latency tradeoff Iterate over different bank sizes Calculate the average network delay based on the number of banks and bank sizes Consider contention values for different cache configurations  Similarly we also consider power consumed for each organization

22 Trade-off Analysis (32 MB Cache) 16 Core CMP

23 Effect of Core Count

24University of Utah24 Power Centric Design (32MB Cache)

Validation  HSPICE tool  Predictive Technology Model (65nm tech.)  Analytical model that employs PTM parameters compared against HSPICE  Distributed wordlines, bitlines, low-swing transmitters, wires, receivers Verified to be within 12% University of Utah25

26 Case Study: Heterogeneous D-NUCA  Dynamic-NUCA Reduces access time by dynamic data movement Near-by banks are accessed more frequently  Heterogeneous Banks  Near-by banks are made smaller and hence faster  Access to nearby banks consume less power  Other banks can be made larger and more power efficient

27 Access Frequency  % request satisfied by x KB of cache

Few Heterogeneous Organizations Considered by CACTI University of Utah28 Model 1 Model 2

29 Other Applications  Exposing wire properties Novel cache pipelining  Early lookup, Aggressive lookup (ISCA 07) Flit-reservation flow control (Peh et al., HPCA 00) Novel topologies  Hybrid network (ISCA 07)

30 Conclusion  Network parameters and contention play a critical role in deciding NUCA organization  Wire choices have significant impact on cache properties  CACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%