Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

Slides:



Advertisements
Similar presentations
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Non-Uniform Cache Architecture Prof. Hsien-Hsin S
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.
1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 8 cores, 64 Threads  Key design issues Architecture.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
1 Lecture 26: Networks, Storage Topics: router microarchitecture, disks, RAID (Appendix D) Final exam: Monday 30 th Apr 10:30-12:30 Same rules as the midterm.
Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
By Islam Atta Supervised by Dr. Ihab Talkhan
Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Presented by: Nick Kirchem Feb 13, 2004
The University of Adelaide, School of Computer Science
Lecture 13: Large Cache Design I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Lecture 12: Cache Innovations
Lecture 1: Parallel Architecture Intro
Design and Management of 3D CMP’s using Network-in-Memory
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
A Case for Interconnect-Aware Architectures
The University of Adelaide, School of Computer Science
Presentation transcript:

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

News..

Moral of the story… 3D technology helps in reducing wire delays –Exploit it in as many ways as you can! –They chose L2 caches Also, 3D leads to on-chip hotspots. –Arrange units intelligently, reduce localized hotspots.

Major Results/Contributions First 3D CMP design space exploration Proposal of 3D NUCA L2 caches for CMP’s. –Comparison with the existing 2D counterparts. –3D works better even without data migration Proposal of NoC’s as a method of communication between L2 banks. –“Efficiently exploit fast vertical interconnects”

Basics… Typical Network-on-Chip architectureMajor types of integration

Proposed : 3D Network-in-Mem L2 Cache bank / or CPU Pillar node Processing Element (Cache Bank or CPU) NIC R b bits Single-Stage Router Processing Element (Cache Bank or CPU) NIC R b bits I n p u t B u f f e r O u t p u t B u f f e r dTDMA Bus NoC /Bus Interface b-bit dTDMA Bus (Communication Pillar) orthogonal to slide Single-Stage Router I n p u t B u f f e r O u t p u t B u f f e r dTDMA Bus NoC/Bus Interface b-bit dTDMA Bus (Communication Pillar) orthogonal to slide Router Communication Pillar dTDMA Bus (Dynamic Time-Division Multiple Access)

The dTDMA Bus as the Communication Pillar 1500 um 10~100 um Use dTDMA bus (VLSID 2006) V efficient/fast bus V small area/power overhead l a y e r s Router dTDMA Bus Arbiter Do not use multi-hop for vertical communication x vertical distance is so small

Proposals (1) Inter-die “communication pillars” Integration of dTDMA buses and NoC routers for a fast communication interface – typical NoC fails due to increased complexity contention issues increased power/area overhead multi-hop vertical comm.

3D Benefit: Increased Locality CPU Nodes within 1 hop Nodes within 2 hops Nodes within 3 hops dTDMA pillar 2D vicinity 3D vicinity

Proposals (2) Cannot increase # of pillars arbitrarily –Depends on via density –Router complexity So, CPU’s share pillars –Stacking of CPU’s also has to be considered CPU placement algorithm –Stack CPU’s across dies so as to Maintain decent access hop-count Manage thermal profile

CPU placement example This way, not stacking CPU’s on top of one another, helps to solve localized hotspot problem

3D L2 Caches Clusters – Cache banks + tag array –Some clusters have CPU’s, others don’t. Cache Management Search Placement & Replacement Cache Line Migration

L2 Cache Management

Simulation Environment Simics + in-house NoC simulator All CPU’s issue in-order –8 CPU’s, SPARC ISA –Directory based protocol for coherence between L1’s and the L2 HS3d for temperature modeling 64MB and 32 MB L2 caches

Performance

Important Results

Important Results (2) Impact of # of “pillars” on access latency

Important Results (3)

Final Word 3D is feasible & scalable… and has arrived. Localized hotspots can be solved by placing hotter units apart. Power savings + performance gain even without data migration –No numbers to support the claim(!) –Would that help the temperature issue as well?

Potential HPCA Submission An evaluation of temperature and IPC for a single core 3D processor Leverage clustered architectures for “temperature aware” processor designs. –Basic premise : Stacking cooler units (caches) on top of hotter units Better thermal profile of processor

Proposals Arch 1 Arch 2 Arch 3 Cache bank Cache bank Cluster

Proposals (2) Cache banks (both data and instruction) are –2 way word-interleaved, or, –Replicated Present study done for 8-cluster architecture

Results (Performance) 2-way word interleaved caches

Results (Performance) Replicated caches

Traffic Analysis

Traffic Analysis (2)

Results (Thermal)