Design and Management of 3D CMP’s using Network-in-Memory Ashok Ayyamani
What is this paper about Architecture for multiprocessors Large shared L2 caches non uniform access times Placement – 3D Caches and CPU Reduce the hit latency Improve IPC Interconnection of CPU and cache nodes Router + bus 3D NoC based non-uniform L2 Cache architecture
NUCA Minimize hit time for large capacity cache For highly-associative caches Each bank has is own distinct address and latency. Faster access to closer banks NUCA Dynamic (Frequent data closer to CPU) Static (data placement depends on address) Variants
Network-in-Memory Why Banks must be interconnected efficiently Large caches increase hit times Divide them into banks Self contained Address individually Banks must be interconnected efficiently Bus Networks-on-Chip
Interconnection with bus With increasing nodes, resource contention becomes an issue. So performance degrades if we increase nodes. Not scalable !! Transactional by nature Solution Networks-on-Chip
Networks-on-Chip On chip network Example - Mesh Scalable Example - Mesh Each Node has “link” to a dedicated router Each router has a “link” to 4 Neighbors “link” - Two unidirectional links with width equal to flit size. (in context) Flit – unit of transfer into which packets are broken for transmission Bus or Router ?? Hybrid ? Will come back to this later. Nothing is perfect.
Networks-on-Chip
3-D Design Problem with large Networks-on-Chip Solution – So many routers increase communication delay even with state-of-the-art routers. Objective is to reduce hop count which is always not possible. Solution – Try to stack them up in 3D, so that there are more banks accessible within fewer hops than in 2D.
Benefits – 3D Higher Packaging Density Higher Performance due to reduced average interconnect length Lower inter-connect power consumption due to reduced total wiring length
3D Technologies Wafer Bonding Multi Layer Buried Structures (MLBS) Process active device layers separately Interconnect them at the end This paper uses this technology Multi Layer Buried Structures (MLBS) Front end process (??) repeats on a single wafer to build multiple device layers Back end process builds the interconnects
Wafer Orientation Face to Face Face to Back Suitable for 2 layers More than 2 layers get complicated Larger and longer vias Face to Back More Layers Reduced inter layer via density
Wafer Orientation
3D - Issues Via Insulation Bottom Line Inter layer Via Pitch state-of-the-art is 0.2 x 0.2 micron sq using Silicon-on-Insulator Via pads (end points) limit via density Bottom Line Despite lower densities, they provide faster data transfer times than 2D wire interconnects
Via pitch From http://www.ltcc.de/en/whatis_des.php A – Via pad E – Via Pitch Rest - NA
3D Network-in-Memory Very small distance between layers Router vs bus Routers are multi-hop and with increasing links (up and down), the blocking probability increases. Solution – Single hop communication medium – bus !! Intra Layer Communication – Routers Inter Layer Communication – dTDMA Bus
3D Network-in-Memory
dTDMA Bus Dynamic TDMA dTDMA bus interface Dynamic allocation of time slots Provides rapid communication between layers of the chip dTDMA bus interface Transmitter and Receiver connected to bus through a tri-state driver Tri-state drivers are controlled by independently programmed feedback shift registers
Arbitration
Arbitration Each Pillar needs an arbiter Arbiter should be in middle so that wire distance is as uniform as possible Number of control wires increase with the no. of layers. So keep the number of layers at minimum Apparently (after experiments), dTDMA bus was more efficient with respect to area and power than conventional NoC Routers. (Tables in next slide).
Arbitration
Area and Power Overhead of dTDMA Bus Number of Layers should be kept minimum for reasons mentioned in last slide Another reason – bus contention
Limitations Area occupied by pillar is wasted device area. Keep number of inter layer connections low Translates into reduced pillars With increasing via density, more vias are feasible Again !! Density limited by via pads (endpoints) Router Complexity goes up More number of ports Increased blocking probability This paper has normal routers (5 ports) + hybrid routers (6 ports – inter layer) Extra port is due to the vertical link
NoC Router Architecture Generally, a router has Routing Unit (RT) Virtual Channel Allocation Unit (VA) Switch Allocation Unit (SA) Crossbar (XBAR) In Mesh topology, we have 5 physical channels per processing element (PE) Virtual channels that are FIFO buffers hold flits from pending messages
NoC Router Architecture The paper uses 3 VC’s per PC, each 1 message deep Each message is 4 flits Width (b) of the router links is 128 bits 4 flits/packet x 128 bits/flit = 512 bits/packet = 64 bytes/packet. So a 64 byte cache line will fit in one packet
NoC Router Architecture The paper uses Single stage router Generally, it takes one cycle per stage So, 4 cycles vs 1 cycle in this paper Aggressive ? May be Look Ahead Routing and Speculative channel Allocation can reduce this Low Latency is very important Router connected to pillar nodes are different They have an extra physical channel that corresponds to FIFO buffers for the vertical link. The Router just sees it as an additional physical channel.
NoC Router Architecture
CPU Placement Each CPU has a dedicated pillar for fast inter layer access CPU’s can share pillars. But not in this paper. So, we assume instant access to the pillar and all cache banks on the pillar Memory locality + vertical locality
CPU Placement
CPU Placement Thermal Issues Congestion Major problem in 3D CPU’s consume most of power So it makes sense not to place them on top of each other in the stack Congestion CPU’s generate most L2 Traffic (rest due to data migration). If we place them one over the other, we will have more congestion since they share the same pillar. Maximal offsetting
CPU Placement
CPU Placement If Via density is low Lesser pillars than CPU cores Sharing of Pillars inevitable Intelligent Placements Not so far from pillars (faster access to pillars) Minimum thermal effects
CPU Placement Algorithm
CPU Placement Algorithm k=1 in the experiments k can be increased at the expense of performance Desirable to have Lower ‘c’ Less contention Better Network Performance Location of pillars predetermined Pillars should be as far as possible to reduce congested areas Not in edges, because, this will limit number of cache banks around the pillars Placement pattern spans 4 layers beyond which they are repeated Thermal effects reduces with inter layer distance
Thermal Aware CPU Placement
Thermal Profile – Hotspots – HS3d
3D L2 Cache Management Cache banks are divided in clusters Cluster contains a set of cache banks Separate tag array for all cache lines in the cluster All banks in a cluster are connected by an NoC Tag array has direct connection to processor array Clusters without local processors have customized logic block for receiving cache requests Searching tag array Forwarding request to target cache bank
Cache Management Policies Cache Line Search Cache Placement Cache Replacement Cache Line Migration
Cache Line Search Two Step Process (1) Processor searches local tag array in the cluster Also sends request to neighbors (also the vertical neighbors through the pillars) (2) If not found in any of these places, processor multicasts the request to remaining clusters If tag match fails in all clusters, it is considered an L2 Miss If there is a match, the corresponding data is routed to requesting processor through NoC
Placement and Replacement Lower Order Bits of the Cache tag indicate the cluster Lower order bits of cache index indicate the bank Remaining bits indicate the precise location in the bank Pseudo LRU policy is used for replacement
Cache Line Migration Intra Layer Data Migration Data is migrated to cluster close to accessing CPU Clusters that have processors are skipped This is done to prevent any effects on L2 access patterns of the local CPU in that cluster Eventually data migrates to the cluster of the processor Because of repeated access
Intra Layer Data Migration
Cache Line Migration Inter Layer Data Migration Data is migrated closer to the pillar near the accessing CPU. Assumption – Clusters near the same pillar in different layers are considered local. No inter layer data migration This also helps reduce power.
Inter Layer Data Migration
Cache Line Migration Lazy Migration To prevent False Misses False misses are those that are caused by searches for data in Migration False Misses occur because of repeated access of a few “hot” blocks by multiple processors. Solution – Delay Migration by a few cycles Cancel Migration when a different Processor accesses the same block
Experiment Methodology Simics with 3D NoC Simulator 8 Processor CMP Solaris 9 In order issue, SPARC ISA Private L1 Cache, shared large L2 Cacti 3.2 dTDMA was integrated into 2D NoC Simulator as the vertical channel L1 Cache Coherence traffic was taken into account
System Configuration
Benchmarks Application was run 500 million cycles for L2 warm up Statistics were collected for next 2 billion cycles Table shows L2 cache accesses
Results Legend CMP-DNUCA – Conventional perfect search CMP-DNUCA-2D - CMP-DNUCA-3D with a single layer CMP-DNUCA-3D – Proposed architecture with data migration CMP-SNUCA-3D – Proposed architecture without data migration
Average L2 Hit Latency
Number of block Migrations
IPC
Average L2 Hit Latency under different cache sizes
Effect of Number of Pillars
Impact of Number of Layers
Conclusion 3D NoC architecture reduces average L2 access latency This improves IPC 3D is better than 2D even without data migration Placement of processors in 3D needs to consider thermal issues carefully Number of pillars should be chosen carefully This in turn can affect congestion and bandwidth
Strength Novel Architecture Solves access time issues Includes Thermal issues And tries to mitigate the same by proper CPU placement Hybrid Network-in-Memory Router + bus Adopts dTDMA for efficient channel usage
Weakness The paper assumes one CPU per pillar As the number of CPU’s increase, this assumption may not be true Via density does not increase and hence number of pillars are fixed So CPU’s may have to share pillars The paper does not discuss the effect on L2 Latency because of this sharing of pillars. Assumes a single stage router May not be always practical/feasible Thermal Aware CPU Placement What is the assumption on the heat flow. Uniform ? Or not ?
Things I did not understand MLBS
Next Paper can be L2 performance degradation because of this sharing of pillars Face to Face wafer bonding Back-to-face results in more wasted area MLBS Effect of Router Speed Paper assumed single cycle for all four stages
Questions Thank you