Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Slides:

Advertisements

Similar presentations

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Case Study - SRAM & Caches

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Power Reduction for FPGA using Multiple Vdd/Vth

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Dept. of Computer Science, UC Irvine

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

The Evicted-Address Filter

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Advanced Caches Smruti R. Sarangi.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

A Case for Interconnect-Aware Architectures

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

University of Utah 2 Motivation Future CMPs likely to be power-limited Growing gap between processor and main memory performance – the Bandwidth Wall –Large caches required to alleviate this problem –Nehalem already has 8MB of last-level cache These large caches contribute significantly to energy consumption –They are often the cache coherence interface in CMPs –Cache energy contribution likely to rise as core energy reduces with simpler and more efficient cores

University of Utah 3 Executive Summary H-tree identified as energy bottleneck within large cache banks Study various techniques to introduce low-swing wiring to address this bottleneck Non-Uniform Power Access to allow access to different regions of cache at different energies Architectural mechanisms to increase fraction of accesses hitting in the low-power region Significant cache energy reductions at very modest performance penalties

University of Utah 4 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

NUCA design Increasing disparity in access delays to different parts of the cache Non-Uniform Cache Access –Divide large cache into multiple “banks” –On-chip network connects these banks and transfers address and data –Bank count and size of each bank determined by relative contribution of banks and network to total energy/delay –Per CACTI 6.0, even a 64MB NUCA cache likely to have large 2 or 4MB banks University of Utah 5 Interconnect Cache Core Cache Core Cache Core Cache Core

Bank design basics University of Utah 6 Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Mux drivers Data output Output driver

Bank design considerations Naïve implementation would take the form of a single array of memory cells with centralized control logic, but such a design would not scale –Wordlines (area considerations) and bitlines (differential signaling) cannot be repeated – delay increase with cache size –Cache bandwidth is a function of cycle time – single array would have small bandwidth Performance limited by wordline/bitline length –Divide into multiple segments called “subarrays” –Subarrays connected by an internal network University of Utah 7

Bank organization Bank organization determined by NDWL,NDBL Fewer subarrays gives increased area efficiency, but larger delay due to longer wordlines/bitlines University of Utah 8 NDWL = 4 NDBL = 4 H-TREE SUBARRAY Interconnect Cache Core Cache Core Cache Core Cache Core

Bank Energy Consumption H-tree is clearly the dominant component of energy consumption University of Utah 9

Low-swing wires High power dissipation in global wires due to full swing requirement imposed by repeaters Use low-voltage swing differential signaling –Two wires per signal –Voltage swing as low as 100mV –Approx. 10X energy savings compared to full swing wires –Increased delay, cannot be used over long distances –Non-trivial pipelining costs What is the best way to use low-swing wires to build the H-tree? University of Utah 10

University of Utah 11 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

Single low-swing bus Simplest solution, simply build entire H-tree with low- swing wires Best energy savings Significant performance drops –Cycle time becomes equal to access time –Increased contention Not worth considering unless energy is considerably more important than performance University of Utah 12

University of Utah 13 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

Multiple low-swing buses Spread contention around Fast vertical bus, tristate buffers at intersections Energy overhead modeled accurately University of Utah 14 LOW-SWING BUS TRI-STATE BUFFERS

University of Utah 15 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

Fully-pipelined low-swing bus Pipelining low-swing wires is non-trivial Differential transmitter and receiver required at every pipeline stage Amortized over 1mm, every transceiver is a 58% energy overhead Performance improves compared to non-pipelined low- swing University of Utah 16

University of Utah 17 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

Non-Uniform Power Access University of Utah 18 LOW-SWING H-TREE TRUNK DEFAULT FULL- SWING H-TREE LOW-POWER REGION HIGH-POWER REGION

Non-Uniform Power Access Introduction of the low-swing trunk does not affect basic H-tree design significantly Limited low-swing length –Access time same as that for the default H-tree –New bus transparent to processor Energy savings proportional to fraction of rows accessible via the low-swing bus –Only two central rows - 1/16 th in our case (NDBL = 32) –Architectural mechanisms required to increase this fraction University of Utah 19

University of Utah 20 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

Exploiting Non-Uniform Power Access Increase fraction of accesses served by the “low- power region” Assign a fraction of the ways of the set to the “low- power region (LP)” and the rest of the ways to the “high-power region (HP)” On every access, check all tags in parallel, if it hits in the LP region, it is a low-power access If not, bring the line into the low-power region at this point – the next use will then likely be a low-power access University of Utah 21

Swap scheme Bring block into low-power region on first-touch The block currently in LRU position in that set is swapped out into the high-power region –Most recently used (MRU) ways of every set are in the LP region Every low-power fetch incurs a swap which costs two low-power and two high-power accesses For Swap to consume less energy than baseline with N accesses –N * H > 2 * H + (N+1) * L –N > 2.5 University of Utah 22

Duplicate scheme Bring block into low-power and high-power region on first touch Block currently in LRU position in low-power region is –Simply dropped if clean – better than Swap –Written back to high-power region if dirty – same as Swap Every L2 miss results in one additional HP access initially Forming equations similar to Swap –N clean > 1.16 –N dirty > 2.6 University of Utah 23

Dynamic Reconfiguration Good energy savings if a modestly high hit-rate in the low-power region Below a certain threshold, extra energy required to move blocks between LP and HP region overshadows savings Track average reuse count and turn-off architectural mechanisms in bad phases, operate like default cache –Single five bit saturating counter for entire cache –Increment counter on hit in LP region, decrement on miss University of Utah 24

Comparison to L2/L3 or Filter Cache Data placement and mapping schemes do bear resemblance to L2/L3 hierarchy or filter cache –our approach is orthogonal to the hierarchy and can continue to be used for the largest last-level cache –need for interconnects between multiple physical cache structures eliminated –Non-uniform access model 25% more efficient than a filter cache model with similar capacities University of Utah 25

University of Utah 26 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

University of Utah 27 Methodology SimpleScalar 3.0 OOO-simulator CACTI 6.0 for cache energy/delay computation 32nm process, 5GHz clock 32K each I- and D-L1, 2-way Unified 4MB L2 cache, 16-way 300 cycle main memory latency SPEC2k benchmark suite

Low-swing design points - Energy University of Utah 28

Low-swing design points - IPC University of Utah 29

Low-swing design points Clearly a trade-off between energy savings and performance drops ED 2 metric –Non-uniform model gives 5% improvement over baseline –Pipelined low-swing model is next best, with a 3% improvement over baseline –These are the two most compelling design points University of Utah 30

Architectural mechanisms University of Utah 31

Dynamic reconfiguration University of Utah 32

Sensitivity to cache size University of Utah 33

University of Utah 34 Outline Cache design background Technique I – Single low-swing bus Technique II – Multiple low-swing buses Technique III – Fully-pipelined low-swing bus Technique IV – Non-Uniform Power Access Technique V – Architectural mechanisms Evaluation Conclusion

University of Utah 35 Related Work Low-swing wires –“Smart memories” project, CACTI 6.0 Cache access energy –Drowsy cache, gated-ground cache, L0 instruction cache, non-uniformity in number of ways per set Ours is the first work to optimize the internal structure of the cache, and propose non-uniform power access within a cache bank

Key Contributions Study of the internal organization of large cache banks, identification of bottleneck Exploration the design space of low-swing wiring within large caches Introduction of the notion of Non-Uniform Power Access –Definition of the architectural mechanisms required to maximize the energy-saving potential of low-swing wires University of Utah 36

University of Utah 37 Thank you.. Questions?