SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan Ahn Many-SC project Design Automation Laboratory Seoul National University
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Trends Increase in the number of cores Larger bandwidth demand Interference in shared resources (caches, off-chip links, …) Larger working set size Limitation Low off-chip bandwidth, high off-chip link energy Limited on-chip cache capacity Limited power budget Etc..
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: Partitioning Objective: Isolation of per-core data Eliminate interference among cores Better throughput by allocating capacity based on demand Examples of partitioning schemes Way partitioning: limited scalability Set partitioning: limited scalability & complex decode logic Replacement policy based: no guarantee of strict isolation Limited number of schemes that provides scalability with strict isolation (e.g., Vantage [Sanchez+ ISCA’11])
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: 3D Stacked DRAM Use 3D stacked DRAM as very large on-chip caches backed by large off-chip main memory Existing approaches LH-cache with MissMap [Loh+ MICRO’11] Alloy cache [Qureshi+ MICRO’12] Hit speculation & self-balancing dispatch [Sim+ MICRO’12] Footprint cache [Jevdjic+ ISCA’13] Dynamic resizing of DRAM caches [Chang+ CMU TR]
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: STT-RAM Researches in DAL LASIC [Ahn+ IEEE TVLSI] Lower-bits cache [Ahn+ ISCAS’12] Selectively protecting ECC [Ahn+ ASP-DAC’13] Write intensity prediction [Ahn+ ISLPED’13] DASCA [Ahn+ HPCA’14] Other researches related to multi/many-core systems STT-RAM aware NoC [Mishra+ ISCA’11] PVA-NUCA [Sun+ ISLPED’12] OAP [Wang+ DATE’12]
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Memory Controllers Numerous proposals for memory scheduling ATLAS [Kim+ HPCA’10] for multiple MCs SMS [Ausavarungnirun+ ISCA’12] for CPU-GPU systems Some other researches for many-core systems Page placement/migration for multiple MCs [Awasthi+ PACT’10] Application-aware channel partitioning & scheduling [Muralidhara+ MICRO’11]
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Interface Various styles of interfaces/technologies Slow parallel buses (e.g., DDR3: multi-drop, DDR4: P2P) SerDes-based high speed serial link (e.g., FB-DIMM, HMC) Silicon interposer (e.g., HBM) TSVs (e.g., Wide I/O) Photonic interconnect
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Summary Memory hierarchy for many-core systems Bandwidth limitation vs. increasing bandwidth demand Becomes more important as more cores are integrated Two main components On-chip caches: data placement, partitioning, emerging memory technologies, … Main memory: memory controllers, interface, …
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB (Flat) Mesh NoCs Single-Chip Cloud (SCC) * TILE64™ † * J. Howard, et al., “A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,” ISSCC, † S. Bell, “TILE64™ processor: A 64-core SoC with mesh interconnect,” ISSCC, 2008.
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB High-radix topologies Butterfly * Flattened butterfly † Alternatives – High-radix topology / Hierarchical topology * W. J. Dally and B. Towles, Principles and practices of Interconnection Networks, Morgan Kaufmann, † J. Kim, et al., “Flattened butterfly: A cost-efficient topology for high-radix networks,” ISCA, 2007.
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Hierarchical topology Concentrated mesh * Bus-mesh hybrid † Alternatives – High-radix topology / Hierarchical topology * J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” ICS, † R. Das, et al. “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009.
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Conclusion Mesh is not scalable in terms of latency and energy consumption. High-radix / hierarchical topology are possible alternatives. Based on the target application, we can choose the NoC architecture. (Possible) New issues include: DSE on hierarchical NoCs DSE on bus-NoC combinations Topology combinations & cluster sizes 3D stacking → Thermal issues Task mapping, topology, routing, etc. perspectives * S. Bourduas and Z. Zilic, “A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing,” NOCS, 2007.
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 17 The end of multicore scaling 4 1.8GHz 2X4 1.8GHz (8 dark) 4 2X1.8GHz (12 dark) 65 nm 32 nm
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 18 Power consumption as process scales [Taylor, DAC 2012]
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Near-Threshold Computing 20 Energy per cycle
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Architecture Candidate 21 Only super-threshold Near-threshold or Near~super threshold
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Parallelism level for Vision App. Parallelism level for vision app. 23 [1] C. Shi, N. Wu, and Z. Wang, "A high-speed vision processor based on pixel-parallel PE array and its applications," in Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, 2010, pp [2] C. Wu, H. Aghajan, and R. Kleihorst, "Real-Time Human Posture Reconstruction in Wireless Smart Camera Networks," in Information Processing in Sensor Networks, IPSN '08. International Conference on, 2008, pp [3]S. Kyo, S. i. Okazaki, and T. Arai, "An integrated memory array processor architecture for embedded image recognition systems," in Computer Architecture, ISCA'05. Proceedings. 32nd International Symposium on, 2005, pp [1] [2] [3]
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Vision Benchmark SD-VBS 24 S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, et al., "SD-VBS: The San Diego Vision Benchmark Suite," in Workload Characterization, IISWC IEEE International Symposium on, 2009, pp
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization Vision benchmark SD-VBS, MEVBench Features (compared with SPEC 2006) Less ILP (Instruction level parallelism) Small register dependent distance Small basic block size Instruction mix ratio Computation intensive : Lots of fp & int operations Not memory intensive : Less load/store operation Memory (VGA & HD) Less memory stress Require small cache size 25
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Latency comparison Case studies & DSE on NoCs in homogeneous many-core architectures28 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison Case studies & DSE on NoCs in homogeneous many-core architectures29 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes 1/2
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison Case studies & DSE on NoCs in homogeneous many-core architectures30 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, /2 DSE in Latency & energy perspectives
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Benchmark comparison Condition Serial execution (Not parallel) Vision app. Smaller basic block size Less ILP than SPEC (Reg. dep. dist. & BBL size) 31 W. Alkohlani and J. Cook, "Towards Performance Predictive Application-Dependent Workload Characterization," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, 2012, pp fp 4 Int Scientific App. Vision SD-VBS Vio- informatics
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Instruction mix ratio Load/store, float, integer, branch Lots of fp & int operation Less load/store operation 32
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Memory footprint (VGA & HD) lower memory stress 33
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Temporal locality Cache size # of Unique cache lines between two access to the same cache line Vision application High temporal locality Don’t need big cache size 34 Cache line size : 64Byte
SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Less ILP Less cache & memory memory pressure 35