Bit-Pragmatic Deep Neural Network Computing J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1: This work was done while J. Albericio was a postdoc at the UofT.
Context Goal: Accelerate for deep learning inference with custom hardware Maximize energy efficiency Focus on Convolutional Neural Networks Dominated by inner products of activations and weights
We will show There are lots of ineffectual bits -> ineffectual computation An energy efficient accelerator design to exploit this
Motivation – ineffectual computation 101111 Weight X 001010 Activation 000000 101111 + 000000 What do we mean by ineffectual bits? Consider this textbook example of binary multiplication of one weight and one activation 101111 000000 000000
Motivation – ineffectual computation 101111 X 001010 Zero-bits lead to Zero terms = Ineffectual Computation 000000 101111 + 000000 Each zero bit in the multiplier results in a row of zeros to be added. We consider this to be ineffectual computation, making these zero bits ineffectual. 101111 000000 000000
Motivation – ineffectual computation 001010 101111 X + 000000 Effectual Bits Conversely, only the 1 bits are effectual and lead to effectual computation
* Current Approaches 1 1 * 1s can be ineffectual too A0 A1 A2 A3 Reduced Precision A0 A1 A2 A3 Activations Sparsity 1 Consider 4 16-bit activations The current approaches to deal with ineffectual computation in neural networks is to exploit reduced precision and sparsity We focus on activations but the same applied to weights In a way these techniques skip ineffectual bits at a course granularity 1
Goal 1 1 A0 A1 A2 A3 Reduced Precision Sparsity Effectual Bits Activations 1 Sparsity Our goal is to go further and only process the effectual bits 1 Effectual Bits
Motivation – Zero bits in activations Do we really need to go after every bit? Isn’t sparsity and reduced precision enough? With current approaches: 59% of bits are ineffectual
Process only the effectual bits Cycle 1: Activation 101111 X 001010 1 1 000000 101111 Offset 1 + 000000 Weight + << 101111 101111 How do we compute only the effectual bits? 000000 1011110 000000
Process only the effectual bits Cycle 2: Activation 101111 X 001010 1 1 000000 101111 Offset 3 + 000000 Weight + << 101111 101111 000000 111010110 000000
Baseline Inner Product Unit 16 Activation A0 32 X Weight W0 16 16 + + Activation A1 32 So that’s how we do one multiplication, how about computing an inner product In our baseline accelerator, DaDianNao, Weight W1 X 16
Shift and Add Inner Product Unit Activation A0 1 4 Offsets K0 15 31 Weight W0 << 16 Activation A1 + + 1 4 Offsets K1 31 Weight W1 << Time is proportional to the number of 1 bits 16
Throughput – Baseline … … … … Activations 16x + “Brick” = 16 elements Weights Change to triangles, annotate dimensions on grid … 16 Units, 16 outputs per cycle 16 Filters … …
Throughput – Pragmatic “Pallet” = 16 bricks Same position in 16 different windows … Activations + << 16x … … … Weights 256 Units, 256 outputs every N cycles … … Process all bits -> same throughput as baseline Skip ineffectual bits -> Speedup 16 Filters … …
Optimization: 2 stage shifting + << 2 19 16 4 2 Stage 4 31 << 16 + + 31 Complete shifters in front of the adder tree will make it very big. The positions of different one-bits are not very far apart. So we break the shifting in two parts << 16 Latency > Power Savings Latency < Power Savings
2 stage shifting - example 1 A0 A1 Cycle 1: 2 Offset K0 W0 1 << + 2 << Offset K1 2 W1 <<
2 stage shifting - example 1 Cycle 2: A1 2 Offset K0 1 W0 4 << + 2 << Offset K1 W1 <<
2 stage shifting - example 1 Cycle 3: A1 2 Offset K0 W0 6 << + 2 << Offset K1 W1 <<
2 stage shifting - example 1 Cycle 4: A1 2 Offset K0 W0 10 << + 2 << Trade off: potential slowdown 3 ones per activation, but takes 4 cycles Offset K1 W1 <<
Optimization – Run-ahead Allow columns to start next brick without waiting for slower columns 1 A0 A1 A2 Brick 17 A0 1 1 1 A0 1 1 Brick 0 Brick 1 A1 1 1 A1 1 A2 1 1 1 A2 1 Different columns operate on different windows, so Requires more scheduling to deliver new weight bricks to individual columns + << + <<
Optimization – Improved Offset Encoding Use booth like encoding to remove strings of 1s 1 A0 { 11, -7, -5, 1 } K0 6 Ones 4 Terms Negative terms are already supported by the pipeline
Methodology Performance: in-house cycle-level simulator Datapath: Synthesis: Synopsys Design Compiler, TSMC 65nm Layout: Cadense Encounter Activity: Mentor Graphics ModelSim Memory Models: eDRAM: Destiny SRAM: CACTI Networks: out-of-the-box Caffe model zoo
Results Performance Energy Efficiency 2 Stage Shifting – Scaling Offset Width All Optimizations Energy Efficiency
Reduce offset to 2 bits: negligible slowdown Scaling Offset Width N << 16 16 + 2N -1 + Evaluate with 6 CNNs Reduce offset to 2 bits: negligible slowdown
Performance = 4.31x vs. DaDianNao With Run Ahead and improved offset encoding performance goes up to 4.31x Performance = 4.31x vs. DaDianNao
Efficiency = 1.70x vs. DaDianNao Energy Efficiency Naïve design with 4 bit offsets is less efficient than the baseline 2 bit offsets with run-ahead improves efficiency, but still worse than STR Finally with improved encoding we get significantly better energy efficiency Efficiency = 1.70x vs. DaDianNao
Remaining Ineffectual bits 7.6% effectual bits -> 13x ideal speedup Performance is lost to load imbalance within bricks and columns + << Brick Size … There is more potential to exploit
Conclusion Pragmatic offers performance proportional to Numerical precision Number of effectual bits Three optimizations to increase performance and reduce power 2 stage shifting Run ahead Improved Offset Encoding 4.3x speedup over DaDianNao 1.7x energy efficiency
Bit-Pragmatic Deep Neural Network Computing J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1:
Figure 1
Table 1
Figure 2
Figure 3
Figure 4
Figure 5a) Baseline Tile
Figure 5b) Pragmatic Tile
Figure 6: PIP
Figure 7a) 2 Stage PIP
Figure 7b) 2 Stage Example
Figure 8: Run-ahead Example
Table 2: Precision Profiles
Figure 9: Perf 2 Stage
Table 3: 2 Stage Area/Power
Figure 10: Perf Run-ahead
Table 4: Run-ahead Area/Power
Figure 11: Perf IOE
Figure 12: Efficiency
Table 5: Area/Power Breakdown
Table 6: Tile Configurations
Table 7: 8-bit Designs