Samira Khan University of Virginia Feb 6, 2019

Samira Khan University of Virginia Feb 6, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 6, 2019 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from last lecture ML accelerators

LOGISTICS Project list
Posted in Piazza Please talk to me if you do not know what to pick Please talk to me you have some idea for the project Sample project proposals from many different years Project Proposal Due on Feb 11, 2019 Project Proposal Presentations: Feb 13, 2019 Can can present using your own laptop Groups: 1 or 2 students

Multicore Decade? We have relied on multicore scaling for over five years. ? Pentium Extreme Dual-Core Core 2 Quad-Core i7 980x Hex-Core How much longer will it be our primary performance scaling technique?

Why Diminishing Returns?
Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

Dark Silicon Sources of Dark Silicon: Power + Limited Parallelism
At 22 nm: At 8 nm: 71% 51% 26% Sources of Dark Silicon: Power + Limited Parallelism 17%

Multicore performance gains are limited
Conclusions Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power Unicore Era Multicore Era Accelerators??

NN Accelerators

What is Deep Learning? “Volvo XC90” Image
Image Source: [Lee et al., Comm. ACM 2011] 17

Why is Deep Learning Hot Now?
Big Data Availability GPU Acceleration New ML Techniques 350M images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute

ImageNet: Image Classification Task
Top 5 Classification Error (%) 30 large error rate reduction 25 due to Deep CNN 20 15 10 5 Hand-crafted feature- based designs Human Deep CNN-based designs [Russakovsky et al., IJCV 2015] 20

Deep Convolutional Neural Networks
High-Level Features Classes CON V Layer NOR M Layer POO L Layer CON V Layer FC Layer Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

Convolution (CONV) Layer
a plane of input activations a.k.a. input feature map (fmap) filter (weights) H R S W

input fmap filter (weights) H R S W Element-wise Multiplication

output fmap an output input fmap filter (weights) activation H E R S W F Element-wise Multiplication Partial Sum (psum) Accumulation

output fmap an output input fmap filter (weights) activation H E R S W F Sliding Window Processing

input fmap filter C output fmap C H E R S W F Many Input Channels (C)

input fmap many filters (M) output fmap C C H M E R 1 S W F … Many Output Channels (M) C R M S

Many Input fmaps (N) Many Output fmaps (N) filters C M C H E R 1 1 S W F … … … C C R E H N S N F W

What are the ways to take advantage of parallelism?
SIMD Dataflow

Highly-Parallel Compute Paradigms
Temporal Architecture (SIMD/SIMT) Spatial Architecture (Dataflow Processing) Memory Hierarchy Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU

Memory Access is the Bottleneck
Memory Read MAC* Memory Write filter weight fmap activation partial sum ALU updated partial sum * multiply-and-accumulate

Memory Read MAC* Memory Write DRAM ALU DRAM * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses Example: AlexNet [NIPS 2012] has 724M MACs  2896M DRAM accesses required

What are the ways to address memory bottleneck?
Memory Hierarchy Exploit locality

Memory Read MAC Memory Write DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy

Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse

Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Input Fmap Filter Reuse: Activations Filter weights

Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filters Input Fmap Input Fmap Filter 1 2 Reuse: Activations Reuse: Activations Filter weights

Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filter Reuse CONV and FC layers (batch size > 1) Input Fmaps Filters Input Fmap Input Fmap Filter Filter 1 1 2 2 Reuse: Activations Reuse: Activations Reuse: Filter weights Filter weights

Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse 11) Can reduce DRAM reads of filter/fmap by up to 500×** ** AlexNet CONV layers

Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM

Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)

Spatial Architecture for DNN
DRAM Local Memory Hierarchy Global Buffer Direct inter-PE network PE-local memory (RF) Global Buffer (100 – 500 kB) ALU AL U AL U AL U ALU ALU ALU ALU Processing Element (PE) ALU ALU AL U ALU Reg File 0.5 – 1.0 kB ALU ALU AL U ALU Control

Low-Cost Local Data Access
How to exploit 1 data reuse and 2 local accumulation with limited low-cost local storage? specialized processing dataflow required! Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF 1× ALU NoC: 200 – PEs PE ALU 2× 100 – 500 kB Buffer ALU 6× DRA M 200× ALU * measured from a commercial 65nm process

Dataflow Taxonomy Weight Stationary (WS) Output Stationary (OS)
No Local Reuse (NLR) [Chen et al., ISCA 2016] 17

Weight Stationary (WS)
Global Buffer Weight Minimize weight read energy consumption − maximize convolutional and filter reuse of weights Broadcast activations and accumulate psums spatially across the PE array. Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE [NeuFlow , Farabet et al., ICCV 2009] 18

Global Buffer Weight Pros Good design if weights are significant Cons Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE 18

Summary of Popular DNNs
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet (v1) ResNet-50 Top-5 error n/a 16.4 7.4 6.7 5.3 Input Size 28x28 227x227 224x224 # of CONV Layers 2 5 16 21 (depth) 49 Filter Sizes 3, 5,11 3 1, 3 , 5, 7 1, 3, 7 # of Channels 1, 6 # of Filters 6, 16 Stride 1 1, 4 1, 2 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 3.9G 11

Global Buffer Weight Pros Good design if weights are significant Reuses partial sums (ofmaps) Cons Have to broadcast activations (ifmaps) and move psums (ofmaps) Activations (ifmaps) are not that small Broadcast is expensive in terms of power and performance Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE 18

Output Stationary (OS)
Global Buffer Activation Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Psum Minimize partial sum R/W energy consumption − maximize local accumulation Broadcast/Multicast filter weights and reuse activations spatially across the PE array [ShiDianNao, Du et al., ISCA 2015] 20

Global Buffer Activation Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Pros Reuses partial sums: Can be a significant overhead as number of read/writes to psum (MAC operation) is way higher than weights Cons 20

Global Buffer Activation Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Pros Reuses partial sums: Can be a significant overhead as number of read/writes to psum (MAC operation) is way higher than weights Activations are passed though each PE, eliminates memory reads Cons Weights need to be broadcasted! 20

No Local Reuse (NLR) Psum
Use a large global buffer as no local − Reduce DRAM access energy consumption Multicast activations, single-cast weights, and accumulate psums spatially across the PE array Global Buffer Weight Activation PE Psum 22

No Local Reuse (NLR) Psum Pros Cons Global Buffer Weight Activation
Still have the advantage of not reading/writing all partial sums Avoids broadcast operation Cons No local reuse A large global buffer is expensive! Still need to perform multicast operations Global Buffer Weight Activation PE [TPU, Jouppi et al., ISCA 2017] 22

Energy Efficiency Comparison
Same total area AlexNet CONV layers 256 PEs Batch size = 16 Variants of OS 2 1.5 Normalized Energy/MAC 1 0.5 OSA OSB OSC CNN Dataflows WS NLR [Chen et al., ISCA 2016]

Energy Efficiency Comparison
Same total area AlexNet CONV layers 256 PEs Batch size = 16 Variants of OS 2 1.5 Normalized Energy/MAC 1 0.5 OSA OSB OSC CNN Dataflows WS NLR Row Stationary [Chen et al., ISCA 2016]

Energy-Efficient Dataflow: Row Stationary (RS)
Maximize reuse and accumulation at RF Optimize for overall energy efficiency instead for only a certain data type [Chen et al., ISCA 2016]

Goals 1. Number of MAC operations is significant Want to maximize reuse of psums 2. At the same time, want to maximize reuse of weights that are used to calculate the psums

Row Stationary: Energy-efficient Dataflow
Input Fmap Filter Output Fmap * =

* 1D Row Convolution in PE = PE Input Fmap Filter Partial Sums
b c d e a b c a b c * = PE Reg File c b a e d c b a

b c d e a b c a b c * = Reg File c b a c b a a PE e d

b c d e a b c a b c * = Reg File b a c b PE e a b

b c d e a b c a b c * = Reg File c b a e d c PE b a c

1D Row Convolution in PE Maximize row convolutional reuse in RF
− Keep a filter row and fmap sliding window in RF Maximize row psum accumulation in RF Reg File c b a e d c PE b a c

1D Row Convolution in PE Maximize row convolutional reuse in RF
− Keep a filter row and fmap sliding window in RF Maximize row psum accumulation in RF Pros Maximizes the reuse of partial sums Also has some reuse of weights Cons How to orchestrate the activations and weights?

2D Row Convolution in PE PE 1 Row 1 * Row 1 * =

* 2D Row Convolution in PE = * * * Row 1 Row 1 Row 1 Row 2 Row 2 Row 3

* * 2D Row Convolution in PE = = * * * * * * Row 1 Row 2 Row 1 Row 1

* * * 2D Row Convolution in PE = = = * * * * * * * * * Row 1 Row 2

Convolutional Reuse Maximized
Row 1 Row 2 Row 3 * Row 1 * Row 2 * Row 3 * Row 2 * Row 3 * Row 4 * Row 3 * Row 4 * Row 5 PE 1 PE 4 PE 7 Row 1 Row 1 Row 1 PE 2 PE 5 PE 8 Row 2 Row 2 Row 2 PE 3 PE 6 PE 9 Row 3 Row 3 Row 3 Filter rows are reused across PEs horizontally

Convolutional Reuse Maximized
Row 1 Row 2 Row 3 Row 1 * Row 1 * Row 1 * Row 2 * Row 2 * Row 2 * Row 3 * Row 3 * Row 3 * PE 1 PE 4 PE 7 Row 1 Row 2 Row 3 PE 2 PE 5 PE 8 Row 2 Row 3 Row 4 PE 3 PE 6 PE 9 Row 3 Row 4 Row 5 Fmap rows are reused across PEs diagonally

Maximize 2D Accumulation in PE Array
Row 1 Row 2 Row 3 PE 1 PE 4 PE 7 Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4 Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5 PE 2 PE 5 PE 8 PE 3 PE 6 PE 9 Partial sums accumulate across PEs vertically

2D Row Convolution in PE Filter rows are reused across PEs horizontally Fmap rows are reused across PEs diagonally Partial sums accumulate across PEs vertically Pros 2D row conv avoid reading/writing psum to global buffer and directly passes to the next PE where Also passes along filter and fmaps to next PEs Cons How to orchestrate the psums, activations and weights?

Dimensions Beyond 2D Convolution
1 Multiple Fmaps Multiple Filters Multiple Channels 2 3

Filter Reuse in PE * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels C Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 H C H R C Filter 1 Fmap 2 Psum 2 R * Channel 1 Row 1 Row 1 = Row 1 H H

Filter Reuse in PE * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels C Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 H C H R C Fmap 2 Psum 2 R * Channel 1 Row 1 Row 1 = Row 1 H share the same filter row H

Filter Reuse in PE * * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels C Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 H C H R C Fmap 2 Psum 2 R * Channel 1 Row 1 Row 1 = Row 1 H share the same filter row H Processing in PE: concatenate fmap rows Filter 1 Fmap 1 & 2 Psum 1 & 2 * Channel 1 Row 1 Row 1 Row 1 = Row 1 Row 1

Fmap Reuse in PE * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 C * Channel 1 Row 1 Row 1 = Row 1 C R R H C Filter 2 Fmap 1 Psum 2 * R H Channel 1 Row 1 Row 1 = Row 1 R

Fmap Reuse in PE * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 C * Channel 1 Row 1 Row 1 = Row 1 C R R H C Filter 2 Fmap 1 Psum 2 * R H Channel 1 Row 1 Row 1 = Row 1 R share the same fmap row

Fmap Reuse in PE * * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 C * Channel 1 Row 1 Row 1 = Row 1 C R R H C Filter 2 Fmap 1 Psum 2 * R H Channel 1 Row 1 Row 1 = Row 1 R share the same fmap row Processing in PE: interleave filter rows Filter 1 & 2 Fmap 1 Psum 1 & 2 * Channel 1 Row 1 =

Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Fmap 1 Psum 1 Filter 1 * Channel 1 Row 1 Row 1 = Row 1 C C R H Filter 1 Fmap 1 Psum 1 R * H Channel 2 Row 1 Row 1 = Row 1

1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 C C R H Filter 1 Fmap 1 R * H Channel 2 Row 1 Row 1 = Row 1 accumulate psums Row 1 + Row 1 = Row 1

1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 C C R H Filter 1 Fmap 1 R * H Channel 2 Row 1 Row 1 = Row 1 accumulate psums Processing in PE: interleave channels Filter 1 Fmap 1 Psum * Channel 1 & 2 = Row 1

DNN Processing – The Full Picture
Filter 1 Image Fmap 1 & 2 Psum 1 & 2 Multiple fmaps: * = Filter 1 & 2 Image Fmap 1 Psum 1 & 2 Multiple filters: * = Filter 1 Image Fmap 1 Psum Multiple channels: * = Map rows from multiple fmaps, filters and channels to same PE to exploit other forms of reuse and local accumulation 52

Optimal Mapping in Row Stationary
CNN Configurations C M C Optimization Compiler (Mapper) H R E 1 1 1 R H E … … … C C R M H E R N N E H Row Stationary Mapping Hardware Resources Global Buffer PE PE PE Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 PE PE PE AL U AL U AL U AL U Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4 PE PE PE Row 3 * Row 3 * Row 3 * AL U AL U AL U AL U Row 3 Row 4 Row 5 Filter 1 Fmap Image 1 & 2 Psum 1 & 2 AL U AL U AL U AL U Multiple fmaps: * = Filter 1 & 2 * Fmap Image 1 Psum 1 & 2 AL U AL U AL U AL U Multiple filters: = Filter 1 Fmap Image 1 Psum Multiple channels: = [Chen et al., ISCA 2016] 53

Computer Architecture Analogy
Compilation DNN Shape and Size (Program) Execution Processed Data Dataflow, … DNN Accelerator (Processor) Mapper (Compiler) (Architecture) Implementation Details (µArch) Mapping (Binary) Input Data [Chen et al., Micro Top-Picks 2017] 54

Samira Khan University of Virginia Feb 6, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 6, 2019 The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Feb 6, 2019

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Feb 6, 2019"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samira Khan University of Virginia Feb 6, 2019

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Feb 6, 2019"— Presentation transcript:

Similar presentations

About project

Feedback