SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Lecture 6: Multicore Systems

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.

High Performing Cache Hierarchies for Server Workloads

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Handling Global Traffic in Future CMP NoCs Ran Manevich, Israel Cidon, and Avinoam Kolodny. Group Research QNoC Electrical Engineering Department Technion.

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

3D Systems with On-Chip DRAM for Enabling

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Interconnection Networks: Introduction

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Hao-Hsuan, Liu IEE5011 –Autumn 2013 Memory Systems 3D DRAM using TSV technology Hao-Hsuan, Liu Department of Electronics Engineering National Chiao Tung.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

CSE Dept., (XHU) 1 The Salishan conference on High-Speed Computing No Free Lunch, No Hidden Cost X. Sharon Hu Dept. Computer Science and Engineering University.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

Architectural and Physical Design Optimization for Efficient Intra-Tile Communication Liza Rodriguez Aurelio Morales EEL Embedded Systems Dept.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

University of Michigan, Ann Arbor

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

William Stallings Computer Organization and Architecture 6th Edition

Lynn Choi School of Electrical Engineering

Reducing Memory Interference in Multicore Systems

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Xiaodong Wang, Shuang Chen, Jeff Setter,

System On Chip.

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Computer Evolution and Performance

Presentation transcript:

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan Ahn Many-SC project Design Automation Laboratory Seoul National University

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Trends Increase in the number of cores Larger bandwidth demand Interference in shared resources (caches, off-chip links, …) Larger working set size Limitation Low off-chip bandwidth, high off-chip link energy Limited on-chip cache capacity Limited power budget Etc..

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: Partitioning Objective: Isolation of per-core data Eliminate interference among cores Better throughput by allocating capacity based on demand Examples of partitioning schemes Way partitioning: limited scalability Set partitioning: limited scalability & complex decode logic Replacement policy based: no guarantee of strict isolation Limited number of schemes that provides scalability with strict isolation (e.g., Vantage [Sanchez+ ISCA’11])

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: 3D Stacked DRAM Use 3D stacked DRAM as very large on-chip caches backed by large off-chip main memory Existing approaches LH-cache with MissMap [Loh+ MICRO’11] Alloy cache [Qureshi+ MICRO’12] Hit speculation & self-balancing dispatch [Sim+ MICRO’12] Footprint cache [Jevdjic+ ISCA’13] Dynamic resizing of DRAM caches [Chang+ CMU TR]

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: STT-RAM Researches in DAL LASIC [Ahn+ IEEE TVLSI] Lower-bits cache [Ahn+ ISCAS’12] Selectively protecting ECC [Ahn+ ASP-DAC’13] Write intensity prediction [Ahn+ ISLPED’13] DASCA [Ahn+ HPCA’14] Other researches related to multi/many-core systems STT-RAM aware NoC [Mishra+ ISCA’11] PVA-NUCA [Sun+ ISLPED’12] OAP [Wang+ DATE’12]

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Memory Controllers Numerous proposals for memory scheduling ATLAS [Kim+ HPCA’10] for multiple MCs SMS [Ausavarungnirun+ ISCA’12] for CPU-GPU systems Some other researches for many-core systems Page placement/migration for multiple MCs [Awasthi+ PACT’10] Application-aware channel partitioning & scheduling [Muralidhara+ MICRO’11]

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Interface Various styles of interfaces/technologies Slow parallel buses (e.g., DDR3: multi-drop, DDR4: P2P) SerDes-based high speed serial link (e.g., FB-DIMM, HMC) Silicon interposer (e.g., HBM) TSVs (e.g., Wide I/O) Photonic interconnect

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Summary Memory hierarchy for many-core systems Bandwidth limitation vs. increasing bandwidth demand Becomes more important as more cores are integrated Two main components On-chip caches: data placement, partitioning, emerging memory technologies, … Main memory: memory controllers, interface, …

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB (Flat) Mesh NoCs Single-Chip Cloud (SCC) * TILE64™ † * J. Howard, et al., “A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,” ISSCC, † S. Bell, “TILE64™ processor: A 64-core SoC with mesh interconnect,” ISSCC, 2008.

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB High-radix topologies Butterfly * Flattened butterfly † Alternatives – High-radix topology / Hierarchical topology * W. J. Dally and B. Towles, Principles and practices of Interconnection Networks, Morgan Kaufmann, † J. Kim, et al., “Flattened butterfly: A cost-efficient topology for high-radix networks,” ISCA, 2007.

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Hierarchical topology Concentrated mesh * Bus-mesh hybrid † Alternatives – High-radix topology / Hierarchical topology * J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” ICS, † R. Das, et al. “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009.

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Conclusion Mesh is not scalable in terms of latency and energy consumption. High-radix / hierarchical topology are possible alternatives. Based on the target application, we can choose the NoC architecture. (Possible) New issues include: DSE on hierarchical NoCs DSE on bus-NoC combinations Topology combinations & cluster sizes 3D stacking → Thermal issues Task mapping, topology, routing, etc. perspectives * S. Bourduas and Z. Zilic, “A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing,” NOCS, 2007.

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 17 The end of multicore scaling 4 1.8GHz 2X4 1.8GHz (8 dark) 4 2X1.8GHz (12 dark) 65 nm 32 nm

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 18 Power consumption as process scales [Taylor, DAC 2012]

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Near-Threshold Computing 19 Claremont, Intel

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Near-Threshold Computing 20 Energy per cycle

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Architecture Candidate 21 Only super-threshold Near-threshold or Near~super threshold

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Parallelism level for Vision App. Parallelism level for vision app. 23 [1] C. Shi, N. Wu, and Z. Wang, "A high-speed vision processor based on pixel-parallel PE array and its applications," in Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, 2010, pp [2] C. Wu, H. Aghajan, and R. Kleihorst, "Real-Time Human Posture Reconstruction in Wireless Smart Camera Networks," in Information Processing in Sensor Networks, IPSN '08. International Conference on, 2008, pp [3]S. Kyo, S. i. Okazaki, and T. Arai, "An integrated memory array processor architecture for embedded image recognition systems," in Computer Architecture, ISCA'05. Proceedings. 32nd International Symposium on, 2005, pp [1] [2] [3]

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Vision Benchmark SD-VBS 24 S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, et al., "SD-VBS: The San Diego Vision Benchmark Suite," in Workload Characterization, IISWC IEEE International Symposium on, 2009, pp

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization Vision benchmark SD-VBS, MEVBench Features (compared with SPEC 2006) Less ILP (Instruction level parallelism) Small register dependent distance Small basic block size Instruction mix ratio Computation intensive : Lots of fp & int operations Not memory intensive : Less load/store operation Memory (VGA & HD) Less memory stress Require small cache size 25

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB QnA

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Appendix

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Latency comparison Case studies & DSE on NoCs in homogeneous many-core architectures28 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison Case studies & DSE on NoCs in homogeneous many-core architectures29 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes 1/2

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison Case studies & DSE on NoCs in homogeneous many-core architectures30 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, /2 DSE in Latency & energy perspectives

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Benchmark comparison Condition Serial execution (Not parallel) Vision app. Smaller basic block size Less ILP than SPEC (Reg. dep. dist. & BBL size) 31 W. Alkohlani and J. Cook, "Towards Performance Predictive Application-Dependent Workload Characterization," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, 2012, pp fp 4 Int Scientific App. Vision SD-VBS Vio- informatics

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Instruction mix ratio Load/store, float, integer, branch Lots of fp & int operation Less load/store operation 32

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Memory footprint (VGA & HD) lower memory stress 33

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Temporal locality  Cache size # of Unique cache lines between two access to the same cache line Vision application High temporal locality Don’t need big cache size 34 Cache line size : 64Byte

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Less ILP Less cache & memory memory pressure 35

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (MEVBench) 36 ILP DLP TLP