Download presentation
Presentation is loading. Please wait.
Published byLiana Rachman Modified over 6 years ago
1
Design and Management of 3D CMP’s using Network-in-Memory
Ashok Ayyamani
2
What is this paper about
Architecture for multiprocessors Large shared L2 caches non uniform access times Placement – 3D Caches and CPU Reduce the hit latency Improve IPC Interconnection of CPU and cache nodes Router + bus 3D NoC based non-uniform L2 Cache architecture
3
NUCA Minimize hit time for large capacity cache For highly-associative caches Each bank has is own distinct address and latency. Faster access to closer banks NUCA Dynamic (Frequent data closer to CPU) Static (data placement depends on address) Variants
4
Network-in-Memory Why Banks must be interconnected efficiently
Large caches increase hit times Divide them into banks Self contained Address individually Banks must be interconnected efficiently Bus Networks-on-Chip
5
Interconnection with bus
With increasing nodes, resource contention becomes an issue. So performance degrades if we increase nodes. Not scalable !! Transactional by nature Solution Networks-on-Chip
6
Networks-on-Chip On chip network Example - Mesh
Scalable Example - Mesh Each Node has “link” to a dedicated router Each router has a “link” to 4 Neighbors “link” - Two unidirectional links with width equal to flit size. (in context) Flit – unit of transfer into which packets are broken for transmission Bus or Router ?? Hybrid ? Will come back to this later. Nothing is perfect.
7
Networks-on-Chip
8
3-D Design Problem with large Networks-on-Chip Solution –
So many routers increase communication delay even with state-of-the-art routers. Objective is to reduce hop count which is always not possible. Solution – Try to stack them up in 3D, so that there are more banks accessible within fewer hops than in 2D.
9
Benefits – 3D Higher Packaging Density Higher Performance
due to reduced average interconnect length Lower inter-connect power consumption due to reduced total wiring length
10
3D Technologies Wafer Bonding Multi Layer Buried Structures (MLBS)
Process active device layers separately Interconnect them at the end This paper uses this technology Multi Layer Buried Structures (MLBS) Front end process (??) repeats on a single wafer to build multiple device layers Back end process builds the interconnects
11
Wafer Orientation Face to Face Face to Back Suitable for 2 layers
More than 2 layers get complicated Larger and longer vias Face to Back More Layers Reduced inter layer via density
12
Wafer Orientation
13
3D - Issues Via Insulation Bottom Line Inter layer Via Pitch
state-of-the-art is 0.2 x 0.2 micron sq using Silicon-on-Insulator Via pads (end points) limit via density Bottom Line Despite lower densities, they provide faster data transfer times than 2D wire interconnects
14
Via pitch From http://www.ltcc.de/en/whatis_des.php A – Via pad
E – Via Pitch Rest - NA
15
3D Network-in-Memory Very small distance between layers Router vs bus
Routers are multi-hop and with increasing links (up and down), the blocking probability increases. Solution – Single hop communication medium – bus !! Intra Layer Communication – Routers Inter Layer Communication – dTDMA Bus
16
3D Network-in-Memory
17
dTDMA Bus Dynamic TDMA dTDMA bus interface
Dynamic allocation of time slots Provides rapid communication between layers of the chip dTDMA bus interface Transmitter and Receiver connected to bus through a tri-state driver Tri-state drivers are controlled by independently programmed feedback shift registers
18
Arbitration
19
Arbitration Each Pillar needs an arbiter
Arbiter should be in middle so that wire distance is as uniform as possible Number of control wires increase with the no. of layers. So keep the number of layers at minimum Apparently (after experiments), dTDMA bus was more efficient with respect to area and power than conventional NoC Routers. (Tables in next slide).
20
Arbitration
21
Area and Power Overhead of dTDMA Bus
Number of Layers should be kept minimum for reasons mentioned in last slide Another reason – bus contention
22
Limitations Area occupied by pillar is wasted device area.
Keep number of inter layer connections low Translates into reduced pillars With increasing via density, more vias are feasible Again !! Density limited by via pads (endpoints) Router Complexity goes up More number of ports Increased blocking probability This paper has normal routers (5 ports) + hybrid routers (6 ports – inter layer) Extra port is due to the vertical link
23
NoC Router Architecture
Generally, a router has Routing Unit (RT) Virtual Channel Allocation Unit (VA) Switch Allocation Unit (SA) Crossbar (XBAR) In Mesh topology, we have 5 physical channels per processing element (PE) Virtual channels that are FIFO buffers hold flits from pending messages
24
NoC Router Architecture
The paper uses 3 VC’s per PC, each 1 message deep Each message is 4 flits Width (b) of the router links is 128 bits 4 flits/packet x 128 bits/flit = 512 bits/packet = 64 bytes/packet. So a 64 byte cache line will fit in one packet
25
NoC Router Architecture
The paper uses Single stage router Generally, it takes one cycle per stage So, 4 cycles vs 1 cycle in this paper Aggressive ? May be Look Ahead Routing and Speculative channel Allocation can reduce this Low Latency is very important Router connected to pillar nodes are different They have an extra physical channel that corresponds to FIFO buffers for the vertical link. The Router just sees it as an additional physical channel.
26
NoC Router Architecture
27
CPU Placement Each CPU has a dedicated pillar for fast inter layer access CPU’s can share pillars. But not in this paper. So, we assume instant access to the pillar and all cache banks on the pillar Memory locality + vertical locality
28
CPU Placement
29
CPU Placement Thermal Issues Congestion Major problem in 3D
CPU’s consume most of power So it makes sense not to place them on top of each other in the stack Congestion CPU’s generate most L2 Traffic (rest due to data migration). If we place them one over the other, we will have more congestion since they share the same pillar. Maximal offsetting
30
CPU Placement
31
CPU Placement If Via density is low Lesser pillars than CPU cores
Sharing of Pillars inevitable Intelligent Placements Not so far from pillars (faster access to pillars) Minimum thermal effects
32
CPU Placement Algorithm
33
CPU Placement Algorithm
k=1 in the experiments k can be increased at the expense of performance Desirable to have Lower ‘c’ Less contention Better Network Performance Location of pillars predetermined Pillars should be as far as possible to reduce congested areas Not in edges, because, this will limit number of cache banks around the pillars Placement pattern spans 4 layers beyond which they are repeated Thermal effects reduces with inter layer distance
34
Thermal Aware CPU Placement
35
Thermal Profile – Hotspots – HS3d
36
3D L2 Cache Management Cache banks are divided in clusters
Cluster contains a set of cache banks Separate tag array for all cache lines in the cluster All banks in a cluster are connected by an NoC Tag array has direct connection to processor array Clusters without local processors have customized logic block for receiving cache requests Searching tag array Forwarding request to target cache bank
37
Cache Management Policies
Cache Line Search Cache Placement Cache Replacement Cache Line Migration
38
Cache Line Search Two Step Process
(1) Processor searches local tag array in the cluster Also sends request to neighbors (also the vertical neighbors through the pillars) (2) If not found in any of these places, processor multicasts the request to remaining clusters If tag match fails in all clusters, it is considered an L2 Miss If there is a match, the corresponding data is routed to requesting processor through NoC
39
Placement and Replacement
Lower Order Bits of the Cache tag indicate the cluster Lower order bits of cache index indicate the bank Remaining bits indicate the precise location in the bank Pseudo LRU policy is used for replacement
40
Cache Line Migration Intra Layer Data Migration
Data is migrated to cluster close to accessing CPU Clusters that have processors are skipped This is done to prevent any effects on L2 access patterns of the local CPU in that cluster Eventually data migrates to the cluster of the processor Because of repeated access
41
Intra Layer Data Migration
42
Cache Line Migration Inter Layer Data Migration
Data is migrated closer to the pillar near the accessing CPU. Assumption – Clusters near the same pillar in different layers are considered local. No inter layer data migration This also helps reduce power.
43
Inter Layer Data Migration
44
Cache Line Migration Lazy Migration To prevent False Misses
False misses are those that are caused by searches for data in Migration False Misses occur because of repeated access of a few “hot” blocks by multiple processors. Solution – Delay Migration by a few cycles Cancel Migration when a different Processor accesses the same block
45
Experiment Methodology
Simics with 3D NoC Simulator 8 Processor CMP Solaris 9 In order issue, SPARC ISA Private L1 Cache, shared large L2 Cacti 3.2 dTDMA was integrated into 2D NoC Simulator as the vertical channel L1 Cache Coherence traffic was taken into account
46
System Configuration
47
Benchmarks Application was run 500 million cycles for L2 warm up
Statistics were collected for next 2 billion cycles Table shows L2 cache accesses
48
Results Legend CMP-DNUCA – Conventional perfect search
CMP-DNUCA-2D - CMP-DNUCA-3D with a single layer CMP-DNUCA-3D – Proposed architecture with data migration CMP-SNUCA-3D – Proposed architecture without data migration
49
Average L2 Hit Latency
50
Number of block Migrations
51
IPC
52
Average L2 Hit Latency under different cache sizes
53
Effect of Number of Pillars
54
Impact of Number of Layers
55
Conclusion 3D NoC architecture reduces average L2 access latency
This improves IPC 3D is better than 2D even without data migration Placement of processors in 3D needs to consider thermal issues carefully Number of pillars should be chosen carefully This in turn can affect congestion and bandwidth
56
Strength Novel Architecture Solves access time issues
Includes Thermal issues And tries to mitigate the same by proper CPU placement Hybrid Network-in-Memory Router + bus Adopts dTDMA for efficient channel usage
57
Weakness The paper assumes one CPU per pillar
As the number of CPU’s increase, this assumption may not be true Via density does not increase and hence number of pillars are fixed So CPU’s may have to share pillars The paper does not discuss the effect on L2 Latency because of this sharing of pillars. Assumes a single stage router May not be always practical/feasible Thermal Aware CPU Placement What is the assumption on the heat flow. Uniform ? Or not ?
58
Things I did not understand
MLBS
59
Next Paper can be L2 performance degradation because of this sharing of pillars Face to Face wafer bonding Back-to-face results in more wasted area MLBS Effect of Router Speed Paper assumed single cycle for all four stages
60
Questions Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.