CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.

GPGPU platforms GP - General Purpose computation using GPU

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Philipp Gysel ECE Department University of California, Davis

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

CMSC 611: Advanced Computer Architecture

Scalpel: Customizing DNN Pruning to the

CS 704 Advanced Computer Architecture

Stanford University.

Analysis of Sparse Convolutional Neural Networks

Yu-Lun Kuo Computer Sciences and Information Engineering

CS161 – Design and Architecture of Computer

The Goal: illusion of large, fast, cheap memory

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

Cache Memories CSE 238/2038/2138: Systems Programming

Improving Memory Access 1/3 The Cache and Virtual Memory

Ramya Kandasamy CS 147 Section 3

The Hardware/Software Interface CSE351 Winter 2013

Addressing: Router Design

FPGA Acceleration of Convolutional Neural Networks

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Rethinking NoCs for Spatial Neural Network Accelerators

Overcoming Resource Underutilization in Spatial CNN Accelerators

Supervised Training of Deep Networks

Convolutional Networks

The University of Adelaide, School of Computer Science

Stripes: Bit-Serial Deep Neural Network Computing

Power-Efficient Machine Learning using FPGAs on POWER Systems

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

CMSC 611: Advanced Computer Architecture

Introduction to Neural Networks

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

Adapted from slides by Sally McKee Cornell University

Final Project presentation

CS 3410, Spring 2014 Computer Science Cornell University

Chapter 4 Multiprocessors

Memory System Performance Chapter 3

Convolution Layer Optimization

EE 193: Parallel Computing

Model Compression Joseph E. Gonzalez

Samira Khan University of Virginia Feb 6, 2019

Samira Khan University of Virginia Feb 4, 2019

Samira Khan University of Virginia Feb 11, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

CS 295: Modern Systems Cache And Memory System

CS 295: Modern Systems Lab2: Convolution Accelerators

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

Presentation transcript:

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing CNN Accelerators” and Joel Emer et. al., “Hardware Architectures for Deep Neural Networks,” tutorial from ISCA 2017

Beware/Disclaimer This field is advancing very quickly/messy right now Lots of papers/implementations always beating each other, with seemingly contradicting results Eyes wide open!

The Need For Neural Network Accelerators Remember: “VGG-16 requires 138 million weights and 15.5G MACs to process one 224 × 224 input image” CPU at 3 GHz, 1 IPC, (3 Giga Operations Per Second – GOPS): 5+ seconds per image Also significant power consumption! (Optimistically assuming 3 GOPS/thread at 8 threads using 100 W, 0.24 GOPS/W) * Old data (2011), and performance varies greatly by implementation, some reporting 3+ GOPS/thread on an i7 Trend is still mostly true! Farabet et. al., “NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision”

Two Major Layers Convolution Layer Fully-Connected Layer Many small (1x1, 3x3, 11x11, …) filters Small number of weights per filter, relatively small number in total vs. FC Over 90% of the MAC operations in a typical model Fully-Connected Layer N-to-N connection between all neurons, large number of weights Conv: * = FC: × = Input map Filters Output map Weights Input vector Output vector

Spatial Mapping of Compute Units Typically a 2D matrix of Processing Elements Each PE is a simple multiply-accumulator Extremely large number of Pes Very high peak throughput! Is memory the bottleneck (Again)? Memory Processing Element

Memory Access is (Typically) the Bottleneck (Again) 100 GOPS requires over 300 Billion weight/activation accesses Assuming 4 byte floats, 1.2 TB/s of memory accesses AlexNet requires 724 Million MACs to process a 227 x 227 image, over 2 Billion weight/activation accesses Assuming 4 byte floats, that is over 8 GB of weight accesses per image 240 GB/s to hit 30 frames per second An interesting questions: Can CPUs achieve this kind of performance? Maybe, but not at low power “About 35% of cycles are spent waiting for weights to load from memory into the matrix unit …” – Jouppi et. al., Google TPU

Spatial Mapping of Compute Units 2 Optimization 1: On-chip network moves data (weights/activations/output) between PEs and memory for reuse Optimization 2: Small, local memory on each PE Typically using a Register File, a special type of memory with zero-cycle latency, but at high spatial overhead Cache invalidation/work assignment… how? Computation is very regular and predictable Memory Register file A class of accelerators deal only with problems that fit entirely in on-chip memory. This distinction is important. Processing Element

Different Strategies of Data Reuse Weight Stationary Try to maximize local weight reuse Output Stationary Try to maximize local partial sum reuse Row Stationary Try to maximize inter-PE data reuse of all kinds No Local Reuse Single/few global on-chip buffer, no per-PE register file and its space/power overhead Terminology from Sze et. al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE 2017

Weight Stationary Keep weights cached in PE register files Effective for convolution especially if all weights can fit in-PEs Each activation is broadcast to all PEs, and computed partial sum is forwarded to other PEs to complete computation Intuition: Each PE is working on an adjacent position of an input row Weight stationary convolution for a row in the convolution Partial sum of a previous activation row if any Partial sum for stored for next activation row, or final sum nn-X, nuFlow, and others

Output Stationary Keep partial sums cached on PEs – Work on subset of output at a time Effective for FC layers, where each output depend on many input/weights Also for convolution layers when it has too many layers Each weight is broadcast to all PEs, and input relayed to neighboring PEs Intuition: Each PE is working on an adjacent position in an output sub-space cached × = Weights Input vector Output vector ShiDianNao, and others

Row Stationary Keep as much related to the same filter row cached… Across PEs Filter weights, input, output… Not much reuse in a PE Weight stationary if filter row fits in register file Eyeriss, and others

Row Stationary Lots of reuse across different PEs Filter row reused horizontally Input row reused diagonally Partial sum reused vertically Even further reuse by interleaving multiple input rows and multiple filter rows

No Local Reuse While in-PE register files are fast and power-efficient, they are not space efficient Instead of distributed register files, use the space to build a much larger global buffer, and read/write everything from there Google TPU, and others

Google TPU Architecture

Static Resource Mapping Sze et. al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE 2017

Map And Fold For Efficient Use of Hardware Requires a flexible on-chip network Sze et. al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE 2017

Overhead of Network-on-Chip Architectures Bus Throughput Crossbar Switch PE Eyeriss PE Mesh

Power Efficiency Comparisons Any of the presented architectures reduce memory pressure enough that memory access is no longer the dominant bottleneck Now what’s important is the power efficiency Goal becomes to reduce as much DRAM access as possible! Joel Emer et. al., “Hardware Architectures for Deep Neural Networks,” tutorial from ISCA 2017

Power Efficiency Comparisons * Some papers report different numbers [1] where NLR with a carefully designed global on-chip memory hierarchy is superior. [1] Yang et. al., “DNN Dataflow Choice Is Overrated,” ArXiv 2018 Sze et. al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE 2017

Power Consumption Comparison Between Convolution and FC Layers Data reuse in FC in inherently low Unless we have enough on- chip buffers to keep all weights, systems methods are not going to be enough Sze et. al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE 2017

Next: Model Compression