Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
News..
Moral of the story… 3D technology helps in reducing wire delays –Exploit it in as many ways as you can! –They chose L2 caches Also, 3D leads to on-chip hotspots. –Arrange units intelligently, reduce localized hotspots.
Major Results/Contributions First 3D CMP design space exploration Proposal of 3D NUCA L2 caches for CMP’s. –Comparison with the existing 2D counterparts. –3D works better even without data migration Proposal of NoC’s as a method of communication between L2 banks. –“Efficiently exploit fast vertical interconnects”
Basics… Typical Network-on-Chip architectureMajor types of integration
Proposed : 3D Network-in-Mem L2 Cache bank / or CPU Pillar node Processing Element (Cache Bank or CPU) NIC R b bits Single-Stage Router Processing Element (Cache Bank or CPU) NIC R b bits I n p u t B u f f e r O u t p u t B u f f e r dTDMA Bus NoC /Bus Interface b-bit dTDMA Bus (Communication Pillar) orthogonal to slide Single-Stage Router I n p u t B u f f e r O u t p u t B u f f e r dTDMA Bus NoC/Bus Interface b-bit dTDMA Bus (Communication Pillar) orthogonal to slide Router Communication Pillar dTDMA Bus (Dynamic Time-Division Multiple Access)
The dTDMA Bus as the Communication Pillar 1500 um 10~100 um Use dTDMA bus (VLSID 2006) V efficient/fast bus V small area/power overhead l a y e r s Router dTDMA Bus Arbiter Do not use multi-hop for vertical communication x vertical distance is so small
Proposals (1) Inter-die “communication pillars” Integration of dTDMA buses and NoC routers for a fast communication interface – typical NoC fails due to increased complexity contention issues increased power/area overhead multi-hop vertical comm.
3D Benefit: Increased Locality CPU Nodes within 1 hop Nodes within 2 hops Nodes within 3 hops dTDMA pillar 2D vicinity 3D vicinity
Proposals (2) Cannot increase # of pillars arbitrarily –Depends on via density –Router complexity So, CPU’s share pillars –Stacking of CPU’s also has to be considered CPU placement algorithm –Stack CPU’s across dies so as to Maintain decent access hop-count Manage thermal profile
CPU placement example This way, not stacking CPU’s on top of one another, helps to solve localized hotspot problem
3D L2 Caches Clusters – Cache banks + tag array –Some clusters have CPU’s, others don’t. Cache Management Search Placement & Replacement Cache Line Migration
L2 Cache Management
Simulation Environment Simics + in-house NoC simulator All CPU’s issue in-order –8 CPU’s, SPARC ISA –Directory based protocol for coherence between L1’s and the L2 HS3d for temperature modeling 64MB and 32 MB L2 caches
Performance
Important Results
Important Results (2) Impact of # of “pillars” on access latency
Important Results (3)
Final Word 3D is feasible & scalable… and has arrived. Localized hotspots can be solved by placing hotter units apart. Power savings + performance gain even without data migration –No numbers to support the claim(!) –Would that help the temperature issue as well?
Potential HPCA Submission An evaluation of temperature and IPC for a single core 3D processor Leverage clustered architectures for “temperature aware” processor designs. –Basic premise : Stacking cooler units (caches) on top of hotter units Better thermal profile of processor
Proposals Arch 1 Arch 2 Arch 3 Cache bank Cache bank Cluster
Proposals (2) Cache banks (both data and instruction) are –2 way word-interleaved, or, –Replicated Present study done for 8-cluster architecture
Results (Performance) 2-way word interleaved caches
Results (Performance) Replicated caches
Traffic Analysis
Traffic Analysis (2)
Results (Thermal)