Design and Management of 3D CMP’s using Network-in-Memory

Design and Management of 3D CMP’s using Network-in-Memory
Ashok Ayyamani

What is this paper about
Architecture for multiprocessors Large shared L2 caches non uniform access times Placement – 3D Caches and CPU Reduce the hit latency Improve IPC Interconnection of CPU and cache nodes Router + bus 3D NoC based non-uniform L2 Cache architecture

NUCA Minimize hit time for large capacity cache For highly-associative caches Each bank has is own distinct address and latency. Faster access to closer banks NUCA Dynamic (Frequent data closer to CPU) Static (data placement depends on address) Variants

Network-in-Memory Why Banks must be interconnected efficiently
Large caches increase hit times Divide them into banks Self contained Address individually Banks must be interconnected efficiently Bus Networks-on-Chip

Interconnection with bus
With increasing nodes, resource contention becomes an issue. So performance degrades if we increase nodes. Not scalable !! Transactional by nature Solution Networks-on-Chip

Networks-on-Chip On chip network Example - Mesh
Scalable Example - Mesh Each Node has “link” to a dedicated router Each router has a “link” to 4 Neighbors “link” - Two unidirectional links with width equal to flit size. (in context) Flit – unit of transfer into which packets are broken for transmission Bus or Router ?? Hybrid ? Will come back to this later. Nothing is perfect.

Networks-on-Chip

3-D Design Problem with large Networks-on-Chip Solution –
So many routers increase communication delay even with state-of-the-art routers. Objective is to reduce hop count which is always not possible. Solution – Try to stack them up in 3D, so that there are more banks accessible within fewer hops than in 2D.

Benefits – 3D Higher Packaging Density Higher Performance
due to reduced average interconnect length Lower inter-connect power consumption due to reduced total wiring length

3D Technologies Wafer Bonding Multi Layer Buried Structures (MLBS)
Process active device layers separately Interconnect them at the end This paper uses this technology Multi Layer Buried Structures (MLBS) Front end process (??) repeats on a single wafer to build multiple device layers Back end process builds the interconnects

Wafer Orientation Face to Face Face to Back Suitable for 2 layers
More than 2 layers get complicated Larger and longer vias Face to Back More Layers Reduced inter layer via density

Wafer Orientation

3D - Issues Via Insulation Bottom Line Inter layer Via Pitch
state-of-the-art is 0.2 x 0.2 micron sq using Silicon-on-Insulator Via pads (end points) limit via density Bottom Line Despite lower densities, they provide faster data transfer times than 2D wire interconnects

Via pitch From http://www.ltcc.de/en/whatis_des.php A – Via pad
E – Via Pitch Rest - NA

3D Network-in-Memory Very small distance between layers Router vs bus
Routers are multi-hop and with increasing links (up and down), the blocking probability increases. Solution – Single hop communication medium – bus !! Intra Layer Communication – Routers Inter Layer Communication – dTDMA Bus

3D Network-in-Memory

dTDMA Bus Dynamic TDMA dTDMA bus interface
Dynamic allocation of time slots Provides rapid communication between layers of the chip dTDMA bus interface Transmitter and Receiver connected to bus through a tri-state driver Tri-state drivers are controlled by independently programmed feedback shift registers

Arbitration

Arbitration Each Pillar needs an arbiter
Arbiter should be in middle so that wire distance is as uniform as possible Number of control wires increase with the no. of layers. So keep the number of layers at minimum Apparently (after experiments), dTDMA bus was more efficient with respect to area and power than conventional NoC Routers. (Tables in next slide).

Arbitration

Area and Power Overhead of dTDMA Bus
Number of Layers should be kept minimum for reasons mentioned in last slide Another reason – bus contention

Limitations Area occupied by pillar is wasted device area.
Keep number of inter layer connections low Translates into reduced pillars With increasing via density, more vias are feasible Again !! Density limited by via pads (endpoints) Router Complexity goes up More number of ports  Increased blocking probability This paper has normal routers (5 ports) + hybrid routers (6 ports – inter layer) Extra port is due to the vertical link

NoC Router Architecture
Generally, a router has Routing Unit (RT) Virtual Channel Allocation Unit (VA) Switch Allocation Unit (SA) Crossbar (XBAR) In Mesh topology, we have 5 physical channels per processing element (PE) Virtual channels that are FIFO buffers hold flits from pending messages

The paper uses 3 VC’s per PC, each 1 message deep Each message is 4 flits Width (b) of the router links is 128 bits 4 flits/packet x 128 bits/flit = 512 bits/packet = 64 bytes/packet. So a 64 byte cache line will fit in one packet

The paper uses Single stage router Generally, it takes one cycle per stage So, 4 cycles vs 1 cycle in this paper Aggressive ? May be Look Ahead Routing and Speculative channel Allocation can reduce this Low Latency is very important Router connected to pillar nodes are different They have an extra physical channel that corresponds to FIFO buffers for the vertical link. The Router just sees it as an additional physical channel.

CPU Placement Each CPU has a dedicated pillar for fast inter layer access CPU’s can share pillars. But not in this paper. So, we assume instant access to the pillar and all cache banks on the pillar Memory locality + vertical locality

CPU Placement

CPU Placement Thermal Issues Congestion Major problem in 3D
CPU’s consume most of power So it makes sense not to place them on top of each other in the stack Congestion CPU’s generate most L2 Traffic (rest due to data migration). If we place them one over the other, we will have more congestion since they share the same pillar. Maximal offsetting

CPU Placement

CPU Placement If Via density is low Lesser pillars than CPU cores
Sharing of Pillars inevitable Intelligent Placements Not so far from pillars (faster access to pillars) Minimum thermal effects

CPU Placement Algorithm

CPU Placement Algorithm
k=1 in the experiments k can be increased at the expense of performance Desirable to have Lower ‘c’ Less contention Better Network Performance Location of pillars predetermined Pillars should be as far as possible to reduce congested areas Not in edges, because, this will limit number of cache banks around the pillars Placement pattern spans 4 layers beyond which they are repeated Thermal effects reduces with inter layer distance

Thermal Aware CPU Placement

Thermal Profile – Hotspots – HS3d

3D L2 Cache Management Cache banks are divided in clusters
Cluster contains a set of cache banks Separate tag array for all cache lines in the cluster All banks in a cluster are connected by an NoC Tag array has direct connection to processor array Clusters without local processors have customized logic block for receiving cache requests Searching tag array Forwarding request to target cache bank

Cache Management Policies
Cache Line Search Cache Placement Cache Replacement Cache Line Migration

Cache Line Search Two Step Process
(1) Processor searches local tag array in the cluster Also sends request to neighbors (also the vertical neighbors through the pillars) (2) If not found in any of these places, processor multicasts the request to remaining clusters If tag match fails in all clusters, it is considered an L2 Miss If there is a match, the corresponding data is routed to requesting processor through NoC

Placement and Replacement
Lower Order Bits of the Cache tag indicate the cluster Lower order bits of cache index indicate the bank Remaining bits indicate the precise location in the bank Pseudo LRU policy is used for replacement

Cache Line Migration Intra Layer Data Migration
Data is migrated to cluster close to accessing CPU Clusters that have processors are skipped This is done to prevent any effects on L2 access patterns of the local CPU in that cluster Eventually data migrates to the cluster of the processor Because of repeated access

Intra Layer Data Migration

Cache Line Migration Inter Layer Data Migration
Data is migrated closer to the pillar near the accessing CPU. Assumption – Clusters near the same pillar in different layers are considered local. No inter layer data migration This also helps reduce power.

Inter Layer Data Migration

Cache Line Migration Lazy Migration To prevent False Misses
False misses are those that are caused by searches for data in Migration False Misses occur because of repeated access of a few “hot” blocks by multiple processors. Solution – Delay Migration by a few cycles Cancel Migration when a different Processor accesses the same block

Experiment Methodology
Simics with 3D NoC Simulator 8 Processor CMP Solaris 9 In order issue, SPARC ISA Private L1 Cache, shared large L2 Cacti 3.2 dTDMA was integrated into 2D NoC Simulator as the vertical channel L1 Cache Coherence traffic was taken into account

System Configuration

Benchmarks Application was run 500 million cycles for L2 warm up
Statistics were collected for next 2 billion cycles Table shows L2 cache accesses

Results Legend CMP-DNUCA – Conventional perfect search
CMP-DNUCA-2D - CMP-DNUCA-3D with a single layer CMP-DNUCA-3D – Proposed architecture with data migration CMP-SNUCA-3D – Proposed architecture without data migration

Average L2 Hit Latency

Number of block Migrations

Average L2 Hit Latency under different cache sizes

Effect of Number of Pillars

Impact of Number of Layers

Conclusion 3D NoC architecture reduces average L2 access latency
This improves IPC 3D is better than 2D even without data migration Placement of processors in 3D needs to consider thermal issues carefully Number of pillars should be chosen carefully This in turn can affect congestion and bandwidth

Strength Novel Architecture Solves access time issues
Includes Thermal issues And tries to mitigate the same by proper CPU placement Hybrid Network-in-Memory Router + bus Adopts dTDMA for efficient channel usage

Weakness The paper assumes one CPU per pillar
As the number of CPU’s increase, this assumption may not be true Via density does not increase and hence number of pillars are fixed So CPU’s may have to share pillars The paper does not discuss the effect on L2 Latency because of this sharing of pillars. Assumes a single stage router May not be always practical/feasible Thermal Aware CPU Placement What is the assumption on the heat flow. Uniform ? Or not ?

Things I did not understand
MLBS

Next Paper can be L2 performance degradation because of this sharing of pillars Face to Face wafer bonding Back-to-face results in more wasted area MLBS Effect of Router Speed Paper assumed single cycle for all four stages

Questions  Thank you

Design and Management of 3D CMP’s using Network-in-Memory

Similar presentations

Presentation on theme: "Design and Management of 3D CMP’s using Network-in-Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Management of 3D CMP’s using Network-in-Memory

Similar presentations

Presentation on theme: "Design and Management of 3D CMP’s using Network-in-Memory"— Presentation transcript:

Similar presentations

About project

Feedback