Design and Management of 3D CMP’s using Network-in-Memory

Slides:



Advertisements
Similar presentations
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
On-Chip Networks and Testing
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Networks-on-Chips (NoCs) Basics
 Network Segments  NICs  Repeaters  Hubs  Bridges  Switches  Routers and Brouters  Gateways 2.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Advanced Caches Smruti R. Sarangi.
Yiting Xia, T. S. Eugene Ng Rice University
How to Train your Dragonfly
Cache Memory.
Architecture and Algorithms for an IEEE 802
Lecture: Large Caches, Virtual Memory
Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Associativity in Caches Lecture 25
Lecture 23: Interconnection Networks
Lecture: Large Caches, Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Multiprocessor Cache Coherency
Exploring Concentration and Channel Slicing in On-chip Network Router
Azeddien M. Sllame, Amani Hasan Abdelkader
Cache Memory Presentation I
Lecture 23: Router Design
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Lecture: Large Caches, Virtual Memory
Chapter 4: Network Layer
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
What’s “Inside” a Router?
Lecture 21: Memory Hierarchy
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Shared Memory Multiprocessors
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
Module IV Memory Organization.
Lecture: Cache Innovations, Virtual Memory
Chapter 6 Memory System Design
Adapted from slides by Sally McKee Cornell University
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Lecture 22: Cache Hierarchies, Memory
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Lecture: Cache Hierarchies
A Case for Interconnect-Aware Architectures
Lecture 25: Interconnection Networks
Chapter 4: Network Layer
Multiprocessors and Multi-computers
Presentation transcript:

Design and Management of 3D CMP’s using Network-in-Memory Ashok Ayyamani

What is this paper about Architecture for multiprocessors Large shared L2 caches non uniform access times Placement – 3D Caches and CPU Reduce the hit latency Improve IPC Interconnection of CPU and cache nodes Router + bus 3D NoC based non-uniform L2 Cache architecture

NUCA Minimize hit time for large capacity cache For highly-associative caches Each bank has is own distinct address and latency. Faster access to closer banks NUCA Dynamic (Frequent data closer to CPU) Static (data placement depends on address) Variants

Network-in-Memory Why Banks must be interconnected efficiently Large caches increase hit times Divide them into banks Self contained Address individually Banks must be interconnected efficiently Bus Networks-on-Chip

Interconnection with bus With increasing nodes, resource contention becomes an issue. So performance degrades if we increase nodes. Not scalable !! Transactional by nature Solution Networks-on-Chip

Networks-on-Chip On chip network Example - Mesh Scalable Example - Mesh Each Node has “link” to a dedicated router Each router has a “link” to 4 Neighbors “link” - Two unidirectional links with width equal to flit size. (in context) Flit – unit of transfer into which packets are broken for transmission Bus or Router ?? Hybrid ? Will come back to this later. Nothing is perfect.

Networks-on-Chip

3-D Design Problem with large Networks-on-Chip Solution – So many routers increase communication delay even with state-of-the-art routers. Objective is to reduce hop count which is always not possible. Solution – Try to stack them up in 3D, so that there are more banks accessible within fewer hops than in 2D.

Benefits – 3D Higher Packaging Density Higher Performance due to reduced average interconnect length Lower inter-connect power consumption due to reduced total wiring length

3D Technologies Wafer Bonding Multi Layer Buried Structures (MLBS) Process active device layers separately Interconnect them at the end This paper uses this technology Multi Layer Buried Structures (MLBS) Front end process (??) repeats on a single wafer to build multiple device layers Back end process builds the interconnects

Wafer Orientation Face to Face Face to Back Suitable for 2 layers More than 2 layers get complicated Larger and longer vias Face to Back More Layers Reduced inter layer via density

Wafer Orientation

3D - Issues Via Insulation Bottom Line Inter layer Via Pitch state-of-the-art is 0.2 x 0.2 micron sq using Silicon-on-Insulator Via pads (end points) limit via density Bottom Line Despite lower densities, they provide faster data transfer times than 2D wire interconnects

Via pitch From http://www.ltcc.de/en/whatis_des.php A – Via pad E – Via Pitch Rest - NA

3D Network-in-Memory Very small distance between layers Router vs bus Routers are multi-hop and with increasing links (up and down), the blocking probability increases. Solution – Single hop communication medium – bus !! Intra Layer Communication – Routers Inter Layer Communication – dTDMA Bus

3D Network-in-Memory

dTDMA Bus Dynamic TDMA dTDMA bus interface Dynamic allocation of time slots Provides rapid communication between layers of the chip dTDMA bus interface Transmitter and Receiver connected to bus through a tri-state driver Tri-state drivers are controlled by independently programmed feedback shift registers

Arbitration

Arbitration Each Pillar needs an arbiter Arbiter should be in middle so that wire distance is as uniform as possible Number of control wires increase with the no. of layers. So keep the number of layers at minimum Apparently (after experiments), dTDMA bus was more efficient with respect to area and power than conventional NoC Routers. (Tables in next slide).

Arbitration

Area and Power Overhead of dTDMA Bus Number of Layers should be kept minimum for reasons mentioned in last slide Another reason – bus contention

Limitations Area occupied by pillar is wasted device area. Keep number of inter layer connections low Translates into reduced pillars With increasing via density, more vias are feasible Again !! Density limited by via pads (endpoints) Router Complexity goes up More number of ports  Increased blocking probability This paper has normal routers (5 ports) + hybrid routers (6 ports – inter layer) Extra port is due to the vertical link

NoC Router Architecture Generally, a router has Routing Unit (RT) Virtual Channel Allocation Unit (VA) Switch Allocation Unit (SA) Crossbar (XBAR) In Mesh topology, we have 5 physical channels per processing element (PE) Virtual channels that are FIFO buffers hold flits from pending messages

NoC Router Architecture The paper uses 3 VC’s per PC, each 1 message deep Each message is 4 flits Width (b) of the router links is 128 bits 4 flits/packet x 128 bits/flit = 512 bits/packet = 64 bytes/packet. So a 64 byte cache line will fit in one packet

NoC Router Architecture The paper uses Single stage router Generally, it takes one cycle per stage So, 4 cycles vs 1 cycle in this paper Aggressive ? May be Look Ahead Routing and Speculative channel Allocation can reduce this Low Latency is very important Router connected to pillar nodes are different They have an extra physical channel that corresponds to FIFO buffers for the vertical link. The Router just sees it as an additional physical channel.

NoC Router Architecture

CPU Placement Each CPU has a dedicated pillar for fast inter layer access CPU’s can share pillars. But not in this paper. So, we assume instant access to the pillar and all cache banks on the pillar Memory locality + vertical locality

CPU Placement

CPU Placement Thermal Issues Congestion Major problem in 3D CPU’s consume most of power So it makes sense not to place them on top of each other in the stack Congestion CPU’s generate most L2 Traffic (rest due to data migration). If we place them one over the other, we will have more congestion since they share the same pillar. Maximal offsetting

CPU Placement

CPU Placement If Via density is low Lesser pillars than CPU cores Sharing of Pillars inevitable Intelligent Placements Not so far from pillars (faster access to pillars) Minimum thermal effects

CPU Placement Algorithm

CPU Placement Algorithm k=1 in the experiments k can be increased at the expense of performance Desirable to have Lower ‘c’ Less contention Better Network Performance Location of pillars predetermined Pillars should be as far as possible to reduce congested areas Not in edges, because, this will limit number of cache banks around the pillars Placement pattern spans 4 layers beyond which they are repeated Thermal effects reduces with inter layer distance

Thermal Aware CPU Placement

Thermal Profile – Hotspots – HS3d

3D L2 Cache Management Cache banks are divided in clusters Cluster contains a set of cache banks Separate tag array for all cache lines in the cluster All banks in a cluster are connected by an NoC Tag array has direct connection to processor array Clusters without local processors have customized logic block for receiving cache requests Searching tag array Forwarding request to target cache bank

Cache Management Policies Cache Line Search Cache Placement Cache Replacement Cache Line Migration

Cache Line Search Two Step Process (1) Processor searches local tag array in the cluster Also sends request to neighbors (also the vertical neighbors through the pillars) (2) If not found in any of these places, processor multicasts the request to remaining clusters If tag match fails in all clusters, it is considered an L2 Miss If there is a match, the corresponding data is routed to requesting processor through NoC

Placement and Replacement Lower Order Bits of the Cache tag indicate the cluster Lower order bits of cache index indicate the bank Remaining bits indicate the precise location in the bank Pseudo LRU policy is used for replacement

Cache Line Migration Intra Layer Data Migration Data is migrated to cluster close to accessing CPU Clusters that have processors are skipped This is done to prevent any effects on L2 access patterns of the local CPU in that cluster Eventually data migrates to the cluster of the processor Because of repeated access

Intra Layer Data Migration

Cache Line Migration Inter Layer Data Migration Data is migrated closer to the pillar near the accessing CPU. Assumption – Clusters near the same pillar in different layers are considered local. No inter layer data migration This also helps reduce power.

Inter Layer Data Migration

Cache Line Migration Lazy Migration To prevent False Misses False misses are those that are caused by searches for data in Migration False Misses occur because of repeated access of a few “hot” blocks by multiple processors. Solution – Delay Migration by a few cycles Cancel Migration when a different Processor accesses the same block

Experiment Methodology Simics with 3D NoC Simulator 8 Processor CMP Solaris 9 In order issue, SPARC ISA Private L1 Cache, shared large L2 Cacti 3.2 dTDMA was integrated into 2D NoC Simulator as the vertical channel L1 Cache Coherence traffic was taken into account

System Configuration

Benchmarks Application was run 500 million cycles for L2 warm up Statistics were collected for next 2 billion cycles Table shows L2 cache accesses

Results Legend CMP-DNUCA – Conventional perfect search CMP-DNUCA-2D - CMP-DNUCA-3D with a single layer CMP-DNUCA-3D – Proposed architecture with data migration CMP-SNUCA-3D – Proposed architecture without data migration

Average L2 Hit Latency

Number of block Migrations

IPC

Average L2 Hit Latency under different cache sizes

Effect of Number of Pillars

Impact of Number of Layers

Conclusion 3D NoC architecture reduces average L2 access latency This improves IPC 3D is better than 2D even without data migration Placement of processors in 3D needs to consider thermal issues carefully Number of pillars should be chosen carefully This in turn can affect congestion and bandwidth

Strength Novel Architecture Solves access time issues Includes Thermal issues And tries to mitigate the same by proper CPU placement Hybrid Network-in-Memory Router + bus Adopts dTDMA for efficient channel usage

Weakness The paper assumes one CPU per pillar As the number of CPU’s increase, this assumption may not be true Via density does not increase and hence number of pillars are fixed So CPU’s may have to share pillars The paper does not discuss the effect on L2 Latency because of this sharing of pillars. Assumes a single stage router May not be always practical/feasible Thermal Aware CPU Placement What is the assumption on the heat flow. Uniform ? Or not ?

Things I did not understand MLBS

Next Paper can be L2 performance degradation because of this sharing of pillars Face to Face wafer bonding Back-to-face results in more wasted area MLBS Effect of Router Speed Paper assumed single cycle for all four stages

Questions  Thank you