Download presentation
Presentation is loading. Please wait.
Published byAntero Hiltunen Modified over 5 years ago
1
Samira Khan University of Virginia Feb 11, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 11, 2019 The content and concept of this course are adapted from CMU ECE 740
2
AGENDA Logistics Review from last lecture ML accelerators
Branch prediction basics
3
LOGISTICS Most students talked to me about the project
Good job! Many interesting directions Project Proposal Due on Feb 11, 2019 Do you need an extension? Feb 14, 2019 Project Proposal Presentations: Feb 13, 2019 Unfortunately, no extension for the presentation
4
NN Accelerator Convolution is the major bottleneck
Characterized by high parallelism and high reuse Exploit high parallelism --> a large number of PEs Exploit data reuse --> Maximize reuse in local PEs Next, maximize reuse in neighboring PEs Minimize accessing the global buffer Exploit data types, sizes, access patterns, etc.
5
Spatial Architecture for DNN
DRAM Local Memory Hierarchy Global Buffer Direct inter-PE network PE-local memory (RF) Global Buffer (100 – 500 kB) ALU AL U AL U AL U ALU ALU ALU ALU Processing Element (PE) ALU ALU AL U ALU Reg File 0.5 – 1.0 kB ALU ALU AL U ALU Control
6
Dataflow Taxonomy Weight Stationary (WS) Output Stationary (OS)
+ Good design if weights are significant + Reuses partial sums (ofmaps) - Have to broadcast activations (ifmaps) and move psums Output Stationary (OS) + Reuses partial sums, activations are passed though each PE, eliminates memory reads - Weights need to be broadcasted! No Local Reuse (NLR) + Partial sums passes though PEs - No local reuse; A large global buffer is expensive! - Need to perform multicast operations 17
7
Energy Efficiency Comparison
Same total area AlexNet CONV layers 256 PEs Batch size = 16 Variants of OS 2 1.5 Normalized Energy/MAC 1 0.5 OSA OSB OSC CNN Dataflows WS NLR [Chen et al., ISCA 2016]
8
Energy Efficiency Comparison
Same total area AlexNet CONV layers 256 PEs Batch size = 16 Variants of OS 2 1.5 Normalized Energy/MAC 1 0.5 OSA OSB OSC CNN Dataflows WS NLR Row Stationary [Chen et al., ISCA 2016]
9
Energy-Efficient Dataflow: Row Stationary (RS)
Maximize reuse and accumulation at RF Optimize for overall energy efficiency instead for only a certain data type [Chen et al., ISCA 2016]
10
Goals 1. Number of MAC operations is significant Want to maximize reuse of psums 2. At the same time, want to maximize reuse of weights and activations that are used to calculate the psums
11
Row Stationary: Energy-efficient Dataflow
Input Fmap Filter Output Fmap * =
12
* 1D Row Convolution in PE = PE Input Fmap Filter Partial Sums
b c d e a b c a b c * = PE Reg File c b a e d c b a
13
* 1D Row Convolution in PE = PE Input Fmap Filter Partial Sums
b c d e a b c a b c * = Reg File c b a c b a a PE e d
14
* 1D Row Convolution in PE = PE Input Fmap Filter Partial Sums
b c d e a b c a b c * = Reg File b a c b PE e a b
15
* 1D Row Convolution in PE = PE Input Fmap Filter Partial Sums
b c d e a b c a b c * = Reg File c b a e d c PE b a c
16
1D Row Convolution in PE Maximize row convolutional reuse in RF
− Keep a filter row and fmap sliding window in RF Maximize row psum accumulation in RF Reg File c b a e d c PE b a c
17
2D Convolution (CONV) Layer
output fmap an output input fmap filter (weights) activation H E R S W F Multiply and accumulate the whole filter How that would look like in 2D row stationary dataflow?
18
One row of filter, ifmap, ofmap mapped to one PE
2D Row Convolution in PE PE 1 Row 1 * Row 1 * = One row of filter, ifmap, ofmap mapped to one PE
19
Different rows of filter, ifmap are mapped to different PEs
2D Row Convolution in PE PE 1 * Row 1 Row 1 PE 2 * Row 2 Row 2 PE 3 * Row 3 Row 3 * = Different rows of filter, ifmap are mapped to different PEs
20
* 2D Row Convolution in PE = * * *
They all still accumulate psum for the same row Need to move psum vertically
21
* * 2D Row Convolution in PE = = * * *
Then the same filter is multiplied to a sliding window of activations to calculate the next row of ofmap
22
Exploit the spatial architecture and map them in other PEs!
2D Row Convolution in PE Row 1 Row 2 PE 1 PE 4 * * Row 1 Row 1 Row 1 Row 2 PE 2 PE 5 * * Row 2 Row 2 Row 2 Row 3 PE 3 PE 6 * * Row 3 Row 3 Row 3 Row 4 * * = = Exploit the spatial architecture and map them in other PEs!
23
Exploit the spatial architecture and map them in other PEs!
2D Row Convolution in PE Row 1 Row 2 Row 3 PE 1 PE 4 PE 7 * * * Row 1 Row 1 Row 1 Row 2 Row 1 Row 3 PE 2 PE 5 PE 8 * * * Row 2 Row 2 Row 2 Row 3 Row 2 Row 4 PE 3 PE 6 PE 9 * * * Row 3 Row 3 Row 3 Row 4 Row 3 Row 5 * * * = = = Exploit the spatial architecture and map them in other PEs!
24
Convolutional Reuse Maximized
Row 1 Row 2 Row 3 * Row 1 * Row 2 * Row 3 * Row 2 * Row 3 * Row 4 * Row 3 * Row 4 * Row 5 PE 1 PE 4 PE 7 Row 1 Row 1 Row 1 PE 2 PE 5 PE 8 Row 2 Row 2 Row 2 PE 3 PE 6 PE 9 Row 3 Row 3 Row 3 Filter rows are reused across PEs horizontally
25
Convolutional Reuse Maximized
Row 1 Row 2 Row 3 Row 1 * Row 1 * Row 1 * Row 2 * Row 2 * Row 2 * Row 3 * Row 3 * Row 3 * PE 1 PE 4 PE 7 Row 1 Row 2 Row 3 PE 2 PE 5 PE 8 Row 2 Row 3 Row 4 PE 3 PE 6 PE 9 Row 3 Row 4 Row 5 Fmap rows are reused across PEs diagonally
26
Maximize 2D Accumulation in PE Array
Row 1 Row 2 Row 3 PE 1 PE 4 PE 7 Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4 Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5 PE 2 PE 5 PE 8 PE 3 PE 6 PE 9 Partial sums accumulate across PEs vertically
27
2D Row Convolution in PE Filter rows are reused across PEs horizontally Fmap rows are reused across PEs diagonally Partial sums accumulate across PEs vertically Pros 2D row conv avoid reading/writing psum to global buffer and directly passes to the next PE where Also passes along filter and fmaps to next PEs Cons How to orchestrate the psums, activations and weights?
28
Convolution (CONV) Layer
Many Input fmaps (N) Many Output fmaps (N) filters C M C H E R 1 1 S W F … … … C C R E H N S N F W Our convolution is 4D!
29
Multiple layers and channels
1 Multiple Fmaps R C H E M F Reuse: Filter weights
30
Multiple layers and channels
1 Multiple Fmaps 2 Multiple Filters R C H E M F R C H E M F Reuse: Filter weights Reuse: Activations
31
Multiple layers and channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels R C H E M F R C H R C H E M F M E F Reuse: Filter weights Reuse: Activations Reuse: Partial sums
32
Dimensions Beyond 2D Convolution
1 Multiple Fmaps Multiple Filters Multiple Channels 2 3
33
Filter Reuse in PE * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Filter 1 Fmap 2 Psum 2 * Channel 1 Row 1 Row 1 = Row 1
34
Filter Reuse in PE * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Fmap 2 Psum 2 * Channel 1 Row 1 Row 1 = Row 1 share the same filter row
35
Filter Reuse in PE * * * 2 Multiple Filters 3 Multiple Channels
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Fmap 2 Psum 2 * Channel 1 Row 1 Row 1 = Row 1 share the same filter row Processing in PE: concatenate fmap rows Filter 1 Fmap 1 & 2 Psum 1 & 2 * Channel 1 Row 1 Row 1 Row 1 = Row 1 Row 1
36
Fmap Reuse in PE * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Filter 2 Fmap 1 Psum 2 * Channel 1 Row 1 Row 1 = Row 1
37
Fmap Reuse in PE * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Filter 2 Fmap 1 Psum 2 * Channel 1 Row 1 Row 1 = Row 1 share the same fmap row
38
Fmap Reuse in PE * * * 1 Multiple Fmaps 3 Multiple Channels
2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 R C H E M F * Channel 1 Row 1 Row 1 = Row 1 Filter 2 Fmap 1 Psum 2 * Channel 1 Row 1 Row 1 = Row 1 share the same fmap row Processing in PE: interleave filter rows Filter 1 & 2 Fmap 1 Psum 1 & 2 * Channel 1 Row 1 =
39
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Fmap 1 Psum 1 Filter 1 * Channel 1 Row 1 Row 1 = Row 1 R C H E M F Filter 1 Fmap 1 Psum 1 * Channel 2 Row 1 Row 1 = Row 1
40
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 R C H E M F Filter 1 Fmap 1 * Channel 2 Row 1 Row 1 = Row 1 accumulate psums Row 1 + Row 1 = Row 1
41
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 * Channel 1 Row 1 Row 1 = Row 1 R C H E M F Filter 1 Fmap 1 * Channel 2 Row 1 Row 1 = Row 1 accumulate psums Processing in PE: interleave channels Filter 1 Fmap 1 Psum * Channel 1 & 2 = Row 1
42
DNN Processing – The Full Picture
Filter 1 Image Fmap 1 & 2 Psum 1 & 2 Multiple fmaps: * = Filter 1 & 2 Image Fmap 1 Psum 1 & 2 Multiple filters: * = Filter 1 Image Fmap 1 Psum Multiple channels: * = Map rows from multiple fmaps, filters and channels to same PE to exploit other forms of reuse and local accumulation 52
43
Optimal Mapping in Row Stationary
CNN Configurations C M C Optimization Compiler (Mapper) H R E 1 1 1 R H E … … … C C R M H E R N N E H Row Stationary Mapping Hardware Resources Global Buffer PE PE PE Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 PE PE PE AL U AL U AL U AL U Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4 PE PE PE Row 3 * Row 3 * Row 3 * AL U AL U AL U AL U Row 3 Row 4 Row 5 Filter 1 Fmap Image 1 & 2 Psum 1 & 2 AL U AL U AL U AL U Multiple fmaps: * = Filter 1 & 2 * Fmap Image 1 Psum 1 & 2 AL U AL U AL U AL U Multiple filters: = Filter 1 Fmap Image 1 Psum Multiple channels: = [Chen et al., ISCA 2016] 53
44
Computer Architecture Analogy
Compilation DNN Shape and Size (Program) Execution Processed Data Dataflow, … DNN Accelerator (Processor) Mapper (Compiler) (Architecture) Implementation Details (µArch) Mapping (Binary) Input Data [Chen et al., Micro Top-Picks 2017] 54
45
Dataflow Simulation Results
46
Evaluate Reuse in Different Dataflows
Weight Stationary Minimize movement of filter weights Output Stationary Minimize movement of partial sums No Local Reuse No PE local storage. Maximize global buffer size. Row Stationary Evaluation Setup Normalized Energy Cost* same total area 256 PEs AlexNet batch size = 16 AL U 1× (Reference) RF AL U 1× PE AL U 2× Buff er AL U 6× DRA M AL U 200×
47
Variants of Output Stationary
OSA OSB OSC Parallel Output Region E M # Output Channels # Output Activations Single Multiple Multiple Multiple Multiple Single Notes Targeting CONV layers FC layers
48
Dataflow Comparison: CONV Layers
2 psums weights activations 1.5 Normalized Energy/MAC 1 0.5 WS OSA OSB OSC CNN Dataflows NLR RS RS optimizes for the best overall energy efficiency [Chen et al., ISCA 2016]
49
Dataflow Comparison: CONV Layers
2 ALU RF NoC buffer DRAM 1.5 Normalized Energy/MAC 1 0.5 WS OSA OSB OSC CNN Dataflows NLR RS RS uses 1.4× – 2.5× lower energy than other dataflows [Chen et al., ISCA 2016]
50
Hardware Architecture for RS Dataflow
[Chen et al., ISSCC 2016]
51
Eyeriss DNN Accelerator
Link Clock Core Clock DNN Accelerator 14×12 PE Array Filter Filt … Glob al Buff er SRA M Fmap Input Fmap … Decomp Psum … Output Fmap 108KB Psum … … Comp ReL U … Off-Chip DRAM 64 bits [Chen et al., ISSCC 2016]
52
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator Filter Input Image Buffer Decomp SRAM Output Image 108KB Comp ReLU Off-Chip DRAM 64 bits 14×12 PE Array Data Delivery Patterns Filt … Fmap … Psum … Filter Delivery Fmap Delivery Psum … … …
53
Chip Spec & Measurement Results
Technology TSMC 65nm LP 1P9M On-Chip Buffer 108 KB # of PEs 168 Scratch Pad / PE 0.5 KB Core Frequency 100 – 250 MHz Peak Performance 33.6 – 84.0 GOPS Word Bit-width 16-bit Fixed-Point Filter Width: 1 – 32 Filter Height: 1 – 12 Natively Supported Num. Filters: 1 – 1024 DNN Shapes Num. Channels: 1 – 1024 Horz. Stride: 1–12 Vert. Stride: 1, 2, 4 4000 µm Glob al Buff er Spatial Arra y (168 PEs) 4000 µm To support 2.66 GMACs [8 billion 16-bit inputs (16GB) and 2.7 billion outputs (5.4GB)], only requires 208.5MB (buffer) and 15.4MB (DRAM) [Chen et al., ISSCC 2016] 71
54
Eyeriss Summary Pros Cons
Minimizes movement of partial sum, weight, fmap in a row No broadcasting Cons Needs to orchestrate data movement (diagonally, vertically, horizontally) Adds more optimizations on top of basic RS Compiler needs to map and interleave Some may say unfair comparison to other data flows!
55
Summary of DNN Dataflows
Weight Stationary Minimize movement of filter weights Popular with processing-in-memory architectures Output Stationary Minimize movement of partial sums Different variants possible! No Local Reuse No PE local storage maximize global buffer size Row Stationary Minimize movement of partial sum, weight, activation in a row 72
56
BRANCH PREDICTION
57
Branch Prediction: Guess the Next Instruction to Fetch
PC 0x0008 0x0004 0x0007 0x0005 0x0006 ?? I-$ DECD RF WB 0x0001 LD R1, MEM[R0] 0x0002 D-$ ADD R2, R2, #1 0x0003 BRZERO 0x0001 0x0004 ADD R3, R2, #1 12 cycles 0x0005 MUL R1, R2, R3 0x0006 LD R2, MEM[R2] Branch prediction 0x0007 LD R0, MEM[R2] 8 cycles Fetch Decode Execute Memory Writeback
58
Misprediction Penalty
PC Flush!! I-$ DECD RF WB 0x0001 LD R1, MEM[R0] 0x0007 0x0006 0x0005 0x0004 0x0003 0x0002 D-$ ADD R2, R2, #1 0x0003 BRZERO 0x0001 0x0004 ADD R3, R2, #1 0x0005 MUL R1, R2, R3 4 cycles 0x0006 LD R2, MEM[R2] 0x0007 LD R0, MEM[R2] Fetch Decode Execute Memory Writeback
59
Performance Analysis correct guess no penalty
incorrect guess 2 bubbles Assume no data dependency related stalls 20% control flow instructions 70% of control flow instructions are taken CPI = [ 1 + (0.20*0.7) * 2 ] = = [ * 2 ] = 1.28 probability of a wrong guess penalty for a wrong guess Can we reduce either of the two penalty terms?
60
BRANCH PREDICTION Idea: Predict the next fetch address (to be used in the next cycle) Requires three things to be predicted at fetch stage: Whether the fetched instruction is a branch (Conditional) branch direction Branch target address (if taken) Observation: Target address remains the same for a conditional direct branch across dynamic instances Idea: Store the target address from previous instance and access it with the PC Called Branch Target Buffer (BTB) or Branch Target Address Cache
61
Fetch Stage with BTB and Direction Prediction
Direction predictor (taken?) taken? PC + inst size Next Fetch Address Program Counter hit? Address of the current branch target address Cache of Target Addresses (BTB: Branch Target Buffer) Always taken CPI = [ 1 + (0.20*0.3) * 2 ] = (70% of branches taken)
62
Three Things to Be Predicted
Requires three things to be predicted at fetch stage: 1. Whether the fetched instruction is a branch 2. (Conditional) branch direction 3. Branch target address (if taken) Third (3.) can be accomplished using a BTB Remember target address computed last time branch was executed First (1.) can be accomplished using a BTB If BTB provides a target address for the program counter, then it must be a branch Or, we can store “branch metadata” bits in instruction cache/memory partially decoded instruction stored in I-cache Second (2.): How do we predict the direction?
63
SOPHISTICATED DIRECTION PREDICTION
Compile time (static) Always not taken Always taken BTFN (Backward taken, forward not taken) Profile based (likely direction) Program analysis based (likely direction) Run time (dynamic) Last time prediction (single-bit) Two-bit counter based prediction Two-level prediction (global vs. local) Hybrid
64
STATIC BRANCH PREDICTION
Profile based Program based Programmer based What are common disadvantages of all three techniques? Cannot adapt to dynamic changes in branch behavior This can be mitigated by a dynamic compiler, but not at a fine granularity (and a dynamic compiler has its overheads…)
65
DYNAMIC BRANCH PREDICTION
Idea: Predict branches based on dynamic information (collected at run-time) Advantages + Prediction based on history of the execution of branches + It can adapt to dynamic changes in branch behavior + No need for static profiling: input set representativeness problem goes away Disadvantages -- More complex (requires additional hardware)
66
LAST TIME PREDICTOR Last time predictor
Single bit per branch (stored in BTB) Indicates which direction branch went last time it executed TTTTTTTTTTNNNNNNNNNN 90% accuracy Always mispredicts the last iteration and the first iteration of a loop branch for (i=0; i<N; i++) { … } Prediction: NTTT …. T NTTT ... T Actual: TTTT N TTTT ... N Accuracy for a loop with N iterations = (N-2)/N + Loop branches for loops with large number of iterations -- Loop branches for loops will small number of iterations TNTNTNTNTNTNTNTNTNTN 0% accuracy Last-time predictor CPI = [ 1 + (0.20*0.15) * 2 ] = (Assuming 85% accuracy)
67
IMPLEMENTING THE LAST-TIME PREDICTOR
tag BTB idx N-bit tag table One Bit Per branch BTB taken? PC+4 = nextPC The 1-bit BHT (Branch History Table) entry is updated with the correct outcome after each execution of a branch
68
STATE MACHINE FOR LAST-TIME PREDICTION
actually taken predict not taken predict taken actually taken actually not taken actually not taken
69
IMPROVING THE LAST TIME PREDICTOR
Problem: A last-time predictor changes its prediction from TNT or NTT too quickly even though the branch may be mostly taken or mostly not taken Solution Idea: Add hysteresis to the predictor so that prediction does not change on a single different outcome Use two bits to track the history of predictions for a branch instead of a single bit Can have 2 states for T or NT instead of 1 state for each Smith, “A Study of Branch Prediction Strategies,” ISCA 1981.
70
TWO-BIT COUNTER BASED PREDICTION
Each branch associated with a two-bit counter One more bit provides hysteresis A strong prediction does not change with one single different outcome Accuracy for a loop with N iterations = (N-1)/N for (i=0; i<N; i++) { … } Prediction: TTTT …. T TTTT ... T TTTT ... T Actual: TTTT N TTTT ... N TTTT ... N TNTNTNTNTNTNTNTNTNTN 50% accuracy (assuming init to weakly taken) + Better prediction accuracy -- More hardware cost (but counter can be part of a BTB entry)
71
Samira Khan University of Virginia Feb 11, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators Samira Khan University of Virginia Feb 11, 2019 The content and concept of this course are adapted from CMU ECE 740
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.