1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
To Include or Not to Include? Natalie Enright Dana Vantrease.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
National Tsing Hua University Po-Yang Hsu,Hsien-Te Chen,
Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
3D Systems with On-Chip DRAM for Enabling
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah.
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
How Multi-threading can increase on-chip parallelism
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Physical Planning for the Architectural Exploration of Large-Scale Chip Multiprocessors Javier de San Pedro, Nikita Nikitin, Jordi Cortadella and Jordi.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
Introspective 3D Chips S. Mysore, B. Agrawal, N. Srivastava, S. Lin, K. Banerjee, T. Sherwood (UCSB), ASPLOS 2006 Shimin Chen (LBA Reading Group Presentation)
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Computer Architecture Challenges Shriniwas Gadage.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
MARS A Scan-Island Based Design Enabling Pre-Bond Testability in Die-Stacked Microprocessors Dean L. Lewis Hsien-Hsin S. Lee Georgia Institute of Technology.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Abhishek Pandey Reconfigurable Computing ECE 506.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,
02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.
1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R 水沼 仁志 2004/06/08.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Die Stacking (3D) Microarchitecture Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh1, Don McCauley, Pat Morrow, Donald.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
Presented by: Nick Kirchem Feb 13, 2004
Exploring Non-Uniform Processing In-Memory Architectures
Interconnect Architecture
An Automated Design Flow for 3D Microarchitecture Evaluation
Die Stacking (3D) Microarchitecture -- from Intel Corporation
A Case for Interconnect-Aware Architectures
Presentation transcript:

1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah

2 Device Layer 2 Vertical Interconnect Silicon 1 Multiple layers of active devices Vertical interconnects between layers Device Layer Silicon 1 Courtesy: K.Bernstein, IBM 2D Chip 3D Chip Layer 1 Layer 2 3D Technologies Very Small ~ 10µm

3 Benefits of 3D Reduction of global interconnect L L Delay/Power reduction Bandwidth Mix-technology integration

4 Previous Proposals Previously in 3D… –Break and stack (Folding) [Puttaswamy et al] Vertical stacking of active devices RegFile Break and Stack All are active HEAT!!! Reduced Intra- block latency

5 An alternative approach? 2D Chip 3D Chip Die 1 Die 0 Prudent Stacking Can: Improve Performance Result in better thermal profile

6 Wire Delays and Performance

7 Clustered Architectures Centralized front-end – I-Cache & D-Cache – LSQ, Rename, Decode – Branch Predictor Clustered back-end –Issue Queue –Regfile, FUs L1 D Cache Cluster Crossbar/Router Front- End Higher clock Frequency, High ILP!!

8 Decentralized Cache Banks L1 D Cache L1 D Cache L1 D Cache Possibly better performance

9 Decentralized Cache Banks L1 D Cache Replicated Cache Banks L1 D Cache L1 D Cache

10 Decentralized Cache Banks L1 D Cache Word Interleaved Cache Banks L1 D Cache Odd WordsEven Words

11 Outline Introduction –Motivation –3D Architectures –Clustered Architectures Proposals Results Conclusions

12 Architecture 1 Cache-on-cluster Die 1 Die 0 Cache Bank Cluster Inter Die Interconnect Intra Die Interconnect

13 Architecture 2 Cluster-on-cluster Die 1 Die 0 Cache Bank Cluster Inter Die Interconnect Intra Die Interconnect

14 Architecture 3 Staggered Die 1 Die 0 Cache Bank Cluster Inter Die Interconnect Intra Die Interconnect

15 Outline Introduction –Motivation –3D Architectures –Clustered Architectures Proposals Results Conclusions

16 Experimental Setup Framework –Simplescalar, Wattch and Hotspot 3.0 –Wire model : 8x global metal plane Benchmarks –SPEC 2K, single threaded Processor Configuration –8 Clusters –64 kB L1 I/D Caches, 2 way set-assoc L1 Data cache Word-Interleaved or Replicated 2D Centralized Cache – Base Case

17 Base Case Performances Best Case 2D Config

18 The 3D Effect 3D Replicated vs 2D Centralized

19 The 3D Effect 3D WI vs 2D Centralized

20 Comparisons 3D Replicated3D WI Best Case 3D - RepBest Case 3D - WI 12% Improvement for best case 3D vs best case 2D Best Case 2D 2D Case

21 Thermal Analysis Wattch for power numbers HotSpot 3.0 for thermal model (grid) –500x500 grid resolution Interconnect power modeling –Attributed to functional units –8X plane wires –Router + Crossbar modeled as separate entity

22 Thermal Profiles Peak Temperature : Hottest on-chip Unit (Celsius)

23 Outline Introduction –Motivation –3D Architectures –Clustered Architectures Proposals Results Conclusions

24 Conclusions Wire delays are critical to performance –Some are more important than others. Prudent block stacking –Performance improvement upto 12% over 2D WI banks + Arch 3 (3D) –Better thermal profiles compared to folding

25 Backup Slides

26 Cluster (a) Arch-1 (cache-on-cluster)(b) Arch-2 (cluster on cluster)(c) Arch-3 (staggered) Cache bankIntra-die horizontal wireInter-die vertical wire Die 1 Die 0 4 Cluster Arrangements