University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

1 Billion Transistor Architectures Interconnect design for low power – Naveen & Karthik Computational unit design for low temperature – Karthik Increased.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Computing Systems Memory Hierarchy.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Case Study - SRAM & Caches

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

ISLPED’99 International Symposium on Low Power Electronics and Design

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

University of Michigan, Ann Arbor

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Advanced Caches Smruti R. Sarangi.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture: Cache Hierarchies

A Case for Interconnect-Aware Architectures

Presentation transcript:

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah2 Motivation: Large Caches  Future processors will have large on-chip caches  Intel Montecito has 24MB on-chip cache  Wire delay dominates in large caches  Conventional design can lead to very high hit time (CACTI access time for 24 MB cache is 90 5GHz, 65nm Tech)  Careful network choices  Improve access time  Open room for several other optimizations  Reduces power significantly

University of Utah3 Effect of L2 Hit Time 8-issue, out-of-order processor (L2-hit time cycles) Avg = 17%

University of Utah4 Cache Design Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

University of Utah5 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays

University of Utah6 Shortcomings  CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest sub-array  Very high hit time for large caches Employs a separate bus for each cache bank for multi-banked caches

University of Utah7 Non-Uniform Cache Access (NUCA)  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks

University of Utah8 Shortcomings  NUCA Banks are sized such that the link latency is one cycle (Kim et al. ASPLOS 02) Increased routing complexity Dissipates more power

University of Utah9 Extension to CACTI  On-chip network Wire model is done using ITRS 2005 parameters Grid network No. of rows = No. of columns (or ½ the no. of columns)  Network latency vs Bank access latency tradeoff Modified the exhaustive search to include the network overhead

University of Utah10 Effect of Network Delay (32MB cache) Delay optimal point

University of Utah11 Outline Overview Cache Design Effect of Network Delay  Wire Design Space  Exploiting Heterogeneous Wires  Results

University of Utah12 Wire Characteristics  Wire Resistance and capacitance per unit length ResistanceCapacitanceBandwidth Width Spacing

University of Utah13 Design Space Exploration  Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth 

University of Utah14 Design Space Exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

University of Utah15 Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 5x

University of Utah16 Access time for different link types Bank Count Bank Access Time Avg Access time 8x-wires4x-wiresL-wires

University of Utah17 Outline Overview Cache Design Effect of Network Delay Wire Design Space  Exploiting Heterogeneous Wires  Results

University of Utah18 Cache Look-Up Total cache access time Network delay (req 6-8 bits to identify the cache Bank) Decoder, Wordline, Bitline delay (req bits of address) Comparator, output driver delay (req remaining address for tag match)  The entire access happens in a sequential manner Bank access

University of Utah19 Early Look-Up  Send partial address in L-wires  Initiate the bank lookup  Wait for the complete address  Complete the access L Early lookup (req bits of address) Tag match  We can hide 60-70% of the bank access delay

University of Utah20 Aggressive Look-Up  Send partial address bits on L-wires  Do early look-up and do partial tag match  Send all the matched blocks aggressively L Agg. lookup (req additional 8-bits of address fpr partial tag match) Tag match at cache controller Network delay reduced

University of Utah21 Aggressive Look-Up  Significant reduction in network delay (for address transfer)  Increase in traffic due to false match < 1%  Marginal increase in link overhead Additional 8-bits of L-wires compared to early lookup -Adds complexity to cache controller -Needs logic to do tag match

University of Utah22 Outline Overview Cache Design Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires  Results

University of Utah23 Experimental Setup  Simplescalar with contention modeled in detail  Single core, 8-issue out-of-order processor  32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization)  32KB I-cache and 32KB D-cache with hit latency of 3 cycles  Main memory latency 300 cycles

University of Utah24 Cache Models ModelBank Access (cycles) Bank CountNetwork LinkDescription 13512B-wiresBased on prior work 2664B-wiresCACTI-L2 3664B & L–wiresEarly Lookup 4664B & L–wiresAgg. Lookup 5664B & L–wiresUpper bound

University of Utah25 Performance Results (Global Wires) Model 2 (CACTI-L2) : Average performance improvement – 11% Performance improvement for L2 latency sensitive benchmarks – 16.3% Model 3 (Early Lookup): Average performance improvement – 14.4% Performance improvement for L2 latency sensitive benchmarks – 21.6% Model 4 (Aggressive Lookup): Average performance improvement – 17.6% Performance improvement for L2 latency sensitive benchmarks – 26.6% Model 6 (L-Network): Average performance improvement – 11.4% Performance improvement for L2 latency sensitive benchmarks – 16.2%

University of Utah26 Performance Results (4X – Wires) Wire delay constrained model Performance improvements are better Early lookup performs 5% better Aggressive model performs 28% better

University of Utah27 Future Work  Heterogeneous network in a CMP environment  Hybrid-network Employs a combination of point-to-point and bus for L- messages  Effective use of L-wires  Latency/bandwidth trade-off  Use of heterogeneous wires in DNUCA environment  Cache design focusing on power Pre-fetching (Power optimized wires) Writeback (Power optimized wires)

University of Utah28 Conclusion  Traditional design approaches for large caches is sub-optimal  Network parameters play a significant role in the performance of large caches  Modified CACTI model, that includes network overhead performs 16.3% better compared to previous models  Heterogeneous network has potential to further improve the performance Early lookup – 21.6% Aggressive lookup – 26.6%