Rethinking NoCs for Spatial Neural Network Accelerators

Slides:

Advertisements

Similar presentations

A Novel 3D Layer-Multiplexed On-Chip Network

Advertisements

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

José Vicente Escamilla José Flich Pedro Javier García 1.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Yu Cai Ken Mai Onur Mutlu

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Philipp Gysel ECE Department University of California, Davis

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

INTERCONNECTION NETWORK

Mohamed Abdelfattah Vaughn Betz

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Stanford University.

Analysis of Sparse Convolutional Neural Networks

Contents Introduction Bus Power Model Related Works Motivation

Architecture and Algorithms for an IEEE 802

Reza Yazdani Albert Segura José-María Arnau Antonio González

Ph.D. in Computer Science

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Advanced Computer Networks

Lecture 23: Interconnection Networks

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

A Study of Group-Tree Matching in Large Scale Group Communications

ESE532: System-on-a-Chip Architecture

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Azeddien M. Sllame, Amani Hasan Abdelkader

Cache Memory Presentation I

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Overcoming Resource Underutilization in Spatial CNN Accelerators

OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Lecture 17: NoC Innovations

Rahul Boyapati. , Jiayi Huang

Deadlock Free Hardware Router with Dynamic Arbiter

Bit-Pragmatic Deep Neural Network Computing

Stripes: Bit-Serial Deep Neural Network Computing

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Power-Efficient Machine Learning using FPGAs on POWER Systems

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Introduction to Neural Networks

Using Packet Information for Efficient Communication in NoCs

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Final Project presentation

Hyoukjun Kwon*, Michael Pellauer**, and Tushar Krishna*

Hyoukjun Kwon Michael Pellauer Tushar Krishna

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing

A Case for Interconnect-Aware Architectures

Lecture 25: Interconnection Networks

Samira Khan University of Virginia Feb 6, 2019

Samira Khan University of Virginia Feb 4, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Samira Khan University of Virginia Feb 11, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

EdgeWise: A Better Stream Processing Engine for the Edge

Presentation transcript:

Rethinking NoCs for Spatial Neural Network Accelerators Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) NOCS 2017 Oct 20, 2017

Emergence of DNN Accelerators Emerging DNN applications

Emergence of DNN Accelerators Convolutional Neural Network (CNN) Convolutional Layers (Feature Extraction) Summarize features Conv. Layer ... Pool. Layer FC “Palace” Intermediate features Features -> could be edges, etc.

Emergence of DNN Accelerators Computation in Convolutional Layers Image source: Y. Chen et al., Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks, ISCA 2016 Explain from images -> filters -> (convolving them) output Sliding window operation over input featuremaps

Emergence of DNN Accelerators Massive Parallelism in Convolutional Layers for(n=0; n<N; n++) { // Input feature maps (IFMaps) for(m=0; m<M; m++) { // Weight Filters for(c=0; c<C; c++) { // IFMap/Weight Channels for(y=0; y<H; y++) { // Input feature map row for(x=0; x<H; x++) { // Input feature map column for(j=0; j<R; j++) { // Weight filter row for(i=0; i<R; i++) { // Weight filter column O[n][m][x][y] += W[m][c][i][j] * I[n][c][y][x]}}}}}}} Accumulation Multiplication

Emergence of DNN Accelerators Spatial DNN Accelerator ASIC Architecture General purpose is not as efficient as DNN accelerators (not good for exracting the paralleims) Dadiannao (MICRO 2014) Eyeriss (ISSCC 2016) 256 PEs (16 in each tile) 168 PEs *PE: processing element

Emergence of DNN Accelerators Spatial DNN Accelerator ASIC Architecture Global Memory (GBM) NoC PE Array PE PE ... PE DRAM PE PE ... PE ... PE PE PE Why do we need a NoC? => Support different traffic patterns dependning on the DNN that is mapped. (If we have only one to support, a special structure wiil be sufficient ) Spatial processing over PEs Multi-Bus: Eyeriss Mesh: Diannao, Dadiannao Crossbar+Mesh: TrueNorth

Challenges with Traditional NoCs Relative Area Overhead Compared to Tiny PEs Bus Throughput Crossbar Switch PE Eyeriss PE Mesh Correspondingly power is also huge Mesh/crossbar are good for CMPs that the cores are large Let’s look at meshes and buses Size of squares of NoC: Total area divided by the number of PEs (256 PEs)

Challenges with Traditional NoCs Bandwidth Alexnet Conv. layer Simulation Results (RS) Details about the simulation will be explained later Compare bus vs mesh Latency -> Runtime No clear winner More PEs -> should lower latency Serialized broad-/multi-casting Bandwidth bottleneck at top level Bus provides low bandwidth for DNN traffic

Challenges with Traditional NoCs Dataflow Style Processing over Spatial PEs ... PE ... PE Systolic Array (TPU) Eyeriss No way to hide the latency Traffic is different from that of CMPs and MPSoCs

Challenges with Traditional NoCs Unique Traffic Patterns Core Core GPU Sen sor Comm GBM NoC PE CMPs MPSoCs DNN Accelerators Dynamic all-to-all traffic Static fixed traffic ?

Traffic Patterns in DNN Accelerators Scatter GBM NoC PE GBM NoC PE One-to-All One-to-Many E.g., filter weight and/or input feature map distribution

Traffic Patterns in DNN Accelerators Gather GBM NoC PE GBM NoC PE All-to-one Many-to-one E.g., partial sum gathering

Traffic Patterns in DNN Accelerators Local GBM NoC PE Key optimization to remove traffic between GBM and PE array and maximize data reuse in the PE array Many one-to-one e.g., psum accumulation

Why Not Traditional NoCs Unique Traffic Patterns Core Core GPU Sen sor Comm GBM NoC PE CMPs MPSoCs DNN Accelerators Scatter Gather Local Dynamic all-to-all traffic Static fixed traffic

Requirements for NoCs in DNN Accelerators High throughput: Many PEs Area/power efficiency: Tiny PEs Low latency: No latency hiding Reconfigurability: Diverse neural network dimensions Optimization Opportunity Three traffic patterns: Specialization for each traffic

Outline Motivations Microswitch Network Evaluations Conclusion Topology Routing Microswitch Network Reconfiguration Flow control Evaluations Latency (throughput) Area Power Energy Conclusion

Topology: Microswitch Network Top Switch REmove Other GBM links Middle Switch Bottom Switch Lv 0 Lv 1 Lv 2 Lv 3 Distribute communication to tiny switches

Routing: Scatter Traffic Tree-based broad/multicasting

Routing: Gather Traffic Mention arbiter Multiple pipelined linear network Bandwidth bound to GBM write bandwidth

Routing: Local Traffic Linear single-cycle multi-hop (SMART*) network H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017 T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013

Microarchitecture: Microswitches As spatial accelerators distribute computation over tiny PEs, we distribute communication over tiny switches

Microswitches: Top Switch Only required for switches connected with global buffer Red boxes: Only necessary to some of the switches Scatter Traffic Only required if a switch has multiple gather inputs Gather Traffic Local Traffic

Microswitches: Middle Switch Only required for switches in the scatter tree Red boxes: Only necessary to some of the switches Scatter Traffic Gather Traffic Local Traffic

Microswitches: Bottom Switch Red boxes: Only necessary to some of the switches Scatter Traffic Gather Traffic Local Traffic

Topology: Microswitch Network Top Switch Middle Switch Bottom Switch Lv 0 Lv 1 Lv 2 Lv 3

Outline Motivations Microswitch Network Evaluations Conclusion Topology Routing Microswitch Network Reconfiguration Flow control Evaluations Latency (throughput) Area Power Energy Conclusion

Scatter Network Reconfiguration Control Registers En_Up Firstbit: Upper En_Down

Scatter Network Reconfiguration Reconfiguration logic En_Up En_Down Firstbit: Upper Recursively check destination PEs in upper/lower subtrees

Scatter Netowrk Recofiguration Reconfiguration logic

Local Network: Linear SMART Dynamic traffic control Static traffic control WHen does H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017 T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013 Supports multiple multi-hop local communication

Reconfiguration Policy Scatter Tree Coarse-grained: Epoch-by-epoch Fine-grained: Cycle-by-cycle for each data Gather No reconfiguration (flow control-based) Local Static: Compiler-based Dynamic: Traffic-based Pipelined We are just supporting all the possible dataflow * Accelerator Dependent

Flow Control Scatter Network Gather Network Local Network On/Off flow control Gather Network On/Off flow control between microswitches Local Network Dynamic flow control: Global arbiter-based control Static flow control: SMART* flow control * SMART flow control Hyoukjun Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, in ISPASS 2017 Tushar Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, in HPCA 2013

Flow Control Why not credit-based flow control? Tiny microswitches: low latency for on/off signals delivery Reduce overhead of credit registers No Credit Registers Short distance

Outline Motivations Microswitch Network Evaluations Conclusion Topology Microswitch Routing Network Reconfiguration Flow control Evaluations Latency/throughput Area Power Energy Conclusion

Evaluation Environment Target Neural Network Alexnet Implementation RTL written in Bluespec System Verilog (BSV) Accelerator Dataflow Weight-Stationary (No local traffic) and Row-stationary (Exploit local traffic)* Latency Measurement RTL simulation over BSV implementation using Bluesim Synthesis Tool Synopsys Design Compiler Standard Cell Library NanGate 15nm PDK Baseline NoCs Bus, Tree, Crossbar, Mesh, and H-Mesh PE Delay 1 cycle * Y. Chen et al., "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," ISCA 2016

Evaluations Latency (Entire Alexnet Convolutional layers) Most congested links? Find the average link utliization Microswitch reduces the latency by 61% compared to mesh

Evaluations Area Microswitch NoC requires 16% area of mesh

Evaluations Power Microswitch NoC only consumes 12% power of mesh Go faster Microswitch NoC only consumes 12% power of mesh

Evaluations Energy Buses always need to broadcast, even for unicast traffic Microswitch NoC enables only necessary links

Conclusion Traditional NoCs are not optimal for traffic in spatial accelerators because such NoCs are tailored for random traffic in cache-coherence traffic in CMPs Microswitch NoC is a scalable solution for four goals, latency, throughput, area, and energy, while traditional NoCs only achieve one of two of them Microswitch NoC also provides reconfigurability so that it can support the dynamism across neural network layers

Conclusion Microswitch NoC is applicable to any spatial accelerator (e.g., cryptography, graph) Microswitch NoC will be available as open source. Please sign up via this link http://synergy.ece.gatech.edu/tools/microswitch-noc/ For general purpose NoC, openSMART is available http://synergy.ece.gatech.edu/tools/opensmart Thank you!