Conference title 1 Addressing heterogeneity, failures and variability in high-performance NoCs José Duato Parallel Architectures Group (GAP) Technical.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
CHALLENGES IN EMBEDDED MEMORY DESIGN AND TEST History and Trends In Embedded System Memory.
Programmable Logic Devices
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Issues in System-Level Direct Networks Jason D. Bakos.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Computer performance.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Interconnect Networks
On-Chip Networks and Testing
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Power Reduction for FPGA using Multiple Vdd/Vth
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
1 Memory Technology Comparison ParameterZ-RAMDRAMSRAM Size11.5x4x StructureSingle Transistor Transistor + Cap 6 Transistor Performance10.5x2x.
CAD for Physical Design of VLSI Circuits
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
CSE 661 PAPER PRESENTATION
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
University of Michigan, Ann Arbor
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering
Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 
Deterministic Communication with SpaceWire
Lynn Choi School of Electrical Engineering
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Architecture & Organization 1
Exploring Concentration and Channel Slicing in On-chip Network Router
Azeddien M. Sllame, Amani Hasan Abdelkader
Architecture & Organization 1
An Automated Design Flow for 3D Microarchitecture Evaluation
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs Publisher/Conf: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Computer Evolution and Performance
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Presentation transcript:

Conference title 1 Addressing heterogeneity, failures and variability in high-performance NoCs José Duato Parallel Architectures Group (GAP) Technical University of Valencia (UPV) Spain

“Heterogeneity, failures and variability in NoCs”, EDCC Outline Current proposals for NOCs Sources of heterogeneity Current designs Our proposal Addressing bandwidth constraints Addressing heat dissipation The role of HyperTransport and QPI Some current research efforts Conclusions

“Heterogeneity, failures and variability in NoCs”, EDCC Current Server Configurations Cluster architectures based on 2- to 8-way motherboards with 4-core chips

“Heterogeneity, failures and variability in NoCs”, EDCC What is next? Prediction is very difficult, especially about the future (Niels Bohr, physicist, ) Extrapolating current trends, the number of cores per chip will increase at a steady rate Main expected difficulties –Communication among cores  Buses and crossbars do not scale  A Network on Chip (NoC) will be required

“Heterogeneity, failures and variability in NoCs”, EDCC What is next? Main expected difficulties –Heat dissipation and power consumption  Known power reduction techniques already implemented in the cores  Either cores are simplified (in-order cores) or better heat extraction techniques are designed –Memory bandwidth and latency  VLSI technology scales much faster than package bandwidth  Multiple interconnect layers increase memory latency  Optical interconnects, proximity communication, and 3D stacking address this problem

“Heterogeneity, failures and variability in NoCs”, EDCC Most current proposals for NOCs… Homogeneous systems –Regular topologies and simple routing algorithms –Load balancing strategies become simpler –A single switch design for all the nodes Goals –Minimize latency –Minimize resource consumption (silicon area) –Minimize power consumption –Automate design space exploration

“Heterogeneity, failures and variability in NoCs”, EDCC Most current proposals… Inherit solutions from first single-chip switches –Wormhole switching  Low latency  Small buffers (low area and power requirements) –2D meshes  Match the 2D layout of current chips  Minimize wiring complexity –Dimension-order routing  Implemented with a finite-state machine (low latency, small area)

“Heterogeneity, failures and variability in NoCs”, EDCC Most current proposals… RouterCPU core L2$ (data) L1D$ L1I$ L1C$ L2C$ L2$ (tags) N E S W Arb Rout. unit Buffers Local processor small buffers FSM DOR path node router

“Heterogeneity, failures and variability in NoCs”, EDCC Sources of Heterogeneity Architectural sources –Access to external memory –Devices with different functionalities –Use of accelerators –Simple and complex cores Technology sources –Manufacturing defects –Manufacturing process variability –Thermal issues –3D stacking Usage model sources –Virtualization –Application specific systems

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Due to the existence of different kinds of devices Access to external memory –On-chip memory controllers –Different number of cores and memory controllers  Example: GPUs with hundreds of cores and less than ten memory controllers

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources DRAM memory controllercore/cache congestion DRAM

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Access to external memory –On-chip memory controllers –Different number of cores and memory controllers  Example: GPUs with hundreds of cores and less than ten memory controllers –Consequences  Heterogeneity in the topology  Asymmetric traffic patterns  Congestion when accessing memory controllers

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Devices with different functionalities –Cache blocks with different sizes and shapes (than processor cores)

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources $ MC $$$$ $ $ $ $ $$$$ $ $ $

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Devices with different functionalities –Cache blocks with different sizes and shapes (than processor cores) –Consequences  Heterogeneity in the topology  Asymmetric traffic patterns Different link bandwidths might be required Different networks might be required (e.g. 2D mesh + binary tree)

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Using accelerators –Efficient use of available transistors –Increases the Flops/Watt ratio –Next device: GPU (already planned by AMD)

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Using accelerators –Efficient use of available transistors –Increases the Flops/Watt ratio –Next device: GPU (already planned by AMD) Simple and complex cores –Few complex cores to run sequential applications efficiently –Simple cores to run parallel applications and increase Flops/watt ratio –Example: Cell processor

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources Using accelerators –Efficient use of available transistors –Increases the Flops/Watt ratio –Next device: GPU (already planned by AMD) Simple and complex cores –Few complex cores to run sequential applications efficiently –Simple cores to run parallel applications and increase Flops/watt ratio –Example: Cell processor Consequences –Heterogeneity in the topology (different sizes) –Asymmetric traffic patterns

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources

“Heterogeneity, failures and variability in NoCs”, EDCC Architectural sources

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources Manufacturing defects –Increase with integration scale –Yield may drop unless fault tolerance solutions are provided –Solution: use alternative paths (fault tolerant routing) Consequences –Asymmetries introduced in the use of links (deadlock issues)

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources A B

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources Manufacturing process variability –Smaller transistor size Increased process variability  Variations in L eff Variations in V th  Variations in wire dimensions Resistance and capacitance variations  Variations in dopant levels Variations in V th –Clock frequency fixed by slowest device –Unacceptable as variability increases

“Heterogeneity, failures and variability in NoCs”, EDCC Systematic front-end variability Link frequency/delay distributions in NoC topology (32nm 4x4 NoC) Lgate map

“Heterogeneity, failures and variability in NoCs”, EDCC Link frequency/delay distributions in NoC topology (32nm 8x8 NoC) In larger networks the scenario is even worse Systematic front-end variability

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources Manufacturing process variability –Possible solutions  Different regions with different speeds  Links with different speeds  Disabled links and/or switches –Consequences  Unbalanced link utilization  Irregular topologies

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources Thermal issues –More transistors integrated as long as they are not active at the same time –Temperature controllers will dynamically adjust clock frequency for different clock domains Consequences –Functional heterogeneity –Performance drops due to congested (low bandwidth) subpaths (passing through slower regions)

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources 3D stacking –The most promising technology to alleviate the memory bandwidth problem –Will aggravate the temperature problem (heat dissipation) Consequences –Traffic asymmetries (# vias vs wires in a chip)

“Heterogeneity, failures and variability in NoCs”, EDCC Technology sources Courtesy of Gabriel Loh, ISCA 2008

“Heterogeneity, failures and variability in NoCs”, EDCC Usage Model sources Virtualization –Enables running applications from different customers in the same computer while guaranteeing security and resource availability –Resources dynamically assigned (increases utilization) –At the on-chip level  Traffic isolation between regions Deadlock issues (routing becomes complex) Shared caches introduce interferences among regions Memory controllers need to be shared

“Heterogeneity, failures and variability in NoCs”, EDCC Usage Model sources A B B A

“Heterogeneity, failures and variability in NoCs”, EDCC Usage Model sources Application specific systems –The application to run is known beforehand (embedded systems)  Non-uniform traffic and some links may not be required –Heterogeneity can lead to silicon area and power savings

“Heterogeneity, failures and variability in NoCs”, EDCC Usage Model sources Application to be mapped Application to be mapped T1 T4 T3 T2 Tn Communication Graph P1P2P3P4 P5 P6 P7 P8P9 P10P11P12P13 Network Topology Mapping Function Mapping Function APSRA Routing Tables Compression Compressed Routing Tables Courtesy of Maurizio Palesi and Shashi Kumar, CODES 2006

“Heterogeneity, failures and variability in NoCs”, EDCC Current Designs Current trends when designing the NoC –Topology: 2D mesh (fits the chip layout) –Switching: wormhole (minimum area requirements for buffers) –Routing: implemented with logic (FSM finite-state-machine), DOR  Low latency, area and power efficient  But, … not suitable for new challenges Manufacturing defects Virtualization Collective communication Power management N E S W Arb Rout. unit Buffers WH: small buffers no VCs FSM logic (DOR)

“Heterogeneity, failures and variability in NoCs”, EDCC Our Proposal bLBDR (Broadcast Logic-Based Distributed Routing) –Removes routing tables both at source nodes and switches –Enables  FSM-based (low-latency, power/area efficient) unicast routing  Tree-based multicast/broadcast routing with no need for tables  Most irregular topologies (i.e. for manufacturing defects) are supported Most topology-agnostic routing algorithms supported (up*/down*, SR) DOR routing in a 2D mesh topology is supported  Definition of multiple regions for virtualization and power management

“Heterogeneity, failures and variability in NoCs”, EDCC System environment For bLBDR to be applied, some conditions must be met: Message headers must contain X and Y offsets, and every switch must know its own coordinates Every end node can communicate with any other node through a minimal path bLBDR, on the other hand: Can be applied on systems with or without virtual channels Supports both wormhole and virtual cut-through switching

“Heterogeneity, failures and variability in NoCs”, EDCC Our Proposal –FSM-based implementation  A set of AND, OR, NOT gates  2 flags per switch output port for routing  An 8-bit register per switch output port for topology/regions definition

“Heterogeneity, failures and variability in NoCs”, EDCC Description

“Heterogeneity, failures and variability in NoCs”, EDCC Description (2)

“Heterogeneity, failures and variability in NoCs”, EDCC Description (3)

“Heterogeneity, failures and variability in NoCs”, EDCC Performance 8x8 mesh, TSMC library 90nm technology (we thank Maurizio Palesi for the evaluation results)

“Heterogeneity, failures and variability in NoCs”, EDCC Addressing Bandwidth Constraints 3D stacking of DRAM seems the most viable and effective approach

“Heterogeneity, failures and variability in NoCs”, EDCC DRAM and Cores in a Single Stack Courtesy of Gabriel Loh, ISCA 2008

“Heterogeneity, failures and variability in NoCs”, EDCC Addressing Heat Dissipation Most feasible techniques to reduce power consumption have already been implemented in current cores Increasing the number of cores will increase power consumption. Options are: –Using simpler cores (e.g. in-order cores)  Niagara 2 has a chip TDP of 95W, and a core TDP of 5.4W, which results in a 32nm scaled core TDP of 1.1W  Atom has a chip TDP of 2.5W, and a core TDP of 1.1W, which results in a 32nm scaled core TDP of 0.5W –Using new techniques to increase heat dissipation  Liquid cooling inside the chip

“Heterogeneity, failures and variability in NoCs”, EDCC Handling Heat Dissipation

“Heterogeneity, failures and variability in NoCs”, EDCC The Role of HyperTransport and QPI Tiled architectures reduce design cost and NoC size, and share memory controllers

“Heterogeneity, failures and variability in NoCs”, EDCC The Role of HyperTransport and QPI Tile architecture versus 4-core Opteron architecture: HT/QPI-based NoCs?

“Heterogeneity, failures and variability in NoCs”, EDCC Reducing Design Cost and Time to Market Instead of stacking a multi-core die and several DRAM dies... Silicon carrier with multiple (smaller) multi-core dies and 3D DRAM stacks –Shorter time to market. Just shrink current dies to next VLSI technology –Better heat distribution, yield, and fault tolerance –Opportunities for design space exploration and optimizations  Number of dies of each kind, component location, interconnect patterns, etc. –Two-level interconnect: network on-chip and network on-substrate  Network on-substrate: Not a new concept; already implemented in SoCs  Network on-substrate implemented with metal layers or silicon waveguides  Perfect fit for HT/QPI: current chip-to-chip interconnects moved to substrate

“Heterogeneity, failures and variability in NoCs”, EDCC Example Based on 4-Core Opteron

“Heterogeneity, failures and variability in NoCs”, EDCC Example Based on 4-Core Opteron

“Heterogeneity, failures and variability in NoCs”, EDCC Some Current Research Efforts Implementation and evaluation of High Node Count HT extensions...

“Heterogeneity, failures and variability in NoCs”, EDCC Some Current Research Efforts … based on HTX reference card from University of Heidelberg, to model at system level what in the future will be within a single package

“Heterogeneity, failures and variability in NoCs”, EDCC Some Current Research Efforts The FPGA implements protocol translation, matching store, routing, and NI

“Heterogeneity, failures and variability in NoCs”, EDCC Expected Results Working prototype with 1024 cores –FPGA implementation of protocol translation to HNCHT –Optimized libraries for MPI and GASNet –Evaluation with sample parallel applications –Extension of cache coherence protocols for using remote memory Limitations –Cache coherence protocols not scalable –Long latency when accessing remote memory –Low bandwidth when accessing remote memory with load/store (limited by MSHRs and load-store queue size in the Opteron)

“Heterogeneity, failures and variability in NoCs”, EDCC Conclusions Future multi-core chips face three big challenges: power consumption (and heat dissipation), memory bandwidth, and on-chip interconnects Despite the simplicity and beauty of homogeneous designs, designers will be forced to consider heterogeneity There exist many sources of heterogeneity, imposed by either architecture, technology, or usage models. No way to escape! It is very challenging, but not impossible, to provide efficient, cost-effective architectural support for heterogeneity in a NoC Some solutions have been proposed for heat dissipation. The question is whether they will become cost effective 3D stacking is the most promising approach to address memory bandwidth. Two flavors (single and multiple stacks) offer very different trade-offs HT/QPI fits very well with on-chip and on-substrate interconnect requirements

Conference title 58 José Duato Parallel Architectures Group (GAP) Technical University of Valencia (UPV) Spain Thank you! Addressing heterogeneity, failures and variability in high-performance NoCs