Bit-Pragmatic Deep Neural Network Computing

Bit-Pragmatic Deep Neural Network Computing
J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1: This work was done while J. Albericio was a postdoc at the UofT.

Context Goal: Accelerate for deep learning inference with custom hardware Maximize energy efficiency Focus on Convolutional Neural Networks Dominated by inner products of activations and weights

We will show There are lots of ineffectual bits -> ineffectual computation An energy efficient accelerator design to exploit this

Motivation – ineffectual computation
101111 Weight X 001010 Activation 000000 101111 + 000000 What do we mean by ineffectual bits? Consider this textbook example of binary multiplication of one weight and one activation 101111 000000 000000

101111 X 001010 Zero-bits lead to Zero terms = Ineffectual Computation 000000 101111 + 000000 Each zero bit in the multiplier results in a row of zeros to be added. We consider this to be ineffectual computation, making these zero bits ineffectual. 101111 000000 000000

001010 101111 X + 000000 Effectual Bits Conversely, only the 1 bits are effectual and lead to effectual computation

* Current Approaches 1 1 * 1s can be ineffectual too A0 A1 A2 A3
Reduced Precision A0 A1 A2 A3 Activations Sparsity 1 Consider 4 16-bit activations The current approaches to deal with ineffectual computation in neural networks is to exploit reduced precision and sparsity We focus on activations but the same applied to weights In a way these techniques skip ineffectual bits at a course granularity 1

Goal 1 1 A0 A1 A2 A3 Reduced Precision Sparsity Effectual Bits
Activations 1 Sparsity Our goal is to go further and only process the effectual bits 1 Effectual Bits

Motivation – Zero bits in activations
Do we really need to go after every bit? Isn’t sparsity and reduced precision enough? With current approaches: 59% of bits are ineffectual

Process only the effectual bits
Cycle 1: Activation 101111 X 001010 1 1 000000 101111 Offset 1 + 000000 Weight + << 101111 101111 How do we compute only the effectual bits? 000000 000000

Process only the effectual bits
Cycle 2: Activation 101111 X 001010 1 1 000000 101111 Offset 3 + 000000 Weight + << 101111 101111 000000 000000

Baseline Inner Product Unit
16 Activation A0 32 X Weight W0 16 16 + + Activation A1 32 So that’s how we do one multiplication, how about computing an inner product In our baseline accelerator, DaDianNao, Weight W1 X 16

Shift and Add Inner Product Unit
Activation A0 1 4 Offsets K0 15 31 Weight W0 << 16 Activation A1 + + 1 4 Offsets K1 31 Weight W1 << Time is proportional to the number of 1 bits 16

Throughput – Baseline … … … … Activations 16x + “Brick” = 16 elements
Weights Change to triangles, annotate dimensions on grid … 16 Units, 16 outputs per cycle 16 Filters … …

Throughput – Pragmatic
“Pallet” = 16 bricks Same position in 16 different windows … Activations + << 16x … … … Weights 256 Units, 256 outputs every N cycles … … Process all bits -> same throughput as baseline Skip ineffectual bits -> Speedup 16 Filters … …

Optimization: 2 stage shifting
+ << 2 19 16 4 2 Stage 4 31 << 16 + + 31 Complete shifters in front of the adder tree will make it very big. The positions of different one-bits are not very far apart. So we break the shifting in two parts << 16 Latency > Power Savings Latency < Power Savings

2 stage shifting - example
1 A0 A1 Cycle 1: 2 Offset K0 W0 1 << + 2 << Offset K1 2 W1 <<

1 Cycle 2: A1 2 Offset K0 1 W0 4 << + 2 << Offset K1 W1 <<

1 Cycle 3: A1 2 Offset K0 W0 6 << + 2 << Offset K1 W1 <<

1 Cycle 4: A1 2 Offset K0 W0 10 << + 2 << Trade off: potential slowdown 3 ones per activation, but takes 4 cycles Offset K1 W1 <<

Optimization – Run-ahead
Allow columns to start next brick without waiting for slower columns 1 A0 A1 A2 Brick 17 A0 1 1 1 A0 1 1 Brick 0 Brick 1 A1 1 1 A1 1 A2 1 1 1 A2 1 Different columns operate on different windows, so Requires more scheduling to deliver new weight bricks to individual columns + << + <<

Optimization – Improved Offset Encoding
Use booth like encoding to remove strings of 1s 1 A0 { , , -5, } K0 6 Ones 4 Terms Negative terms are already supported by the pipeline

Methodology Performance: in-house cycle-level simulator Datapath:
Synthesis: Synopsys Design Compiler, TSMC 65nm Layout: Cadense Encounter Activity: Mentor Graphics ModelSim Memory Models: eDRAM: Destiny SRAM: CACTI Networks: out-of-the-box Caffe model zoo

Results Performance Energy Efficiency
2 Stage Shifting – Scaling Offset Width All Optimizations Energy Efficiency

Reduce offset to 2 bits: negligible slowdown
Scaling Offset Width N << 16 16 + 2N -1 + Evaluate with 6 CNNs Reduce offset to 2 bits: negligible slowdown

Performance = 4.31x vs. DaDianNao
With Run Ahead and improved offset encoding performance goes up to 4.31x Performance = 4.31x vs. DaDianNao

Efficiency = 1.70x vs. DaDianNao
Energy Efficiency Naïve design with 4 bit offsets is less efficient than the baseline 2 bit offsets with run-ahead improves efficiency, but still worse than STR Finally with improved encoding we get significantly better energy efficiency Efficiency = 1.70x vs. DaDianNao

Remaining Ineffectual bits
7.6% effectual bits -> 13x ideal speedup Performance is lost to load imbalance within bricks and columns + << Brick Size … There is more potential to exploit

Conclusion Pragmatic offers performance proportional to
Numerical precision Number of effectual bits Three optimizations to increase performance and reduce power 2 stage shifting Run ahead Improved Offset Encoding 4.3x speedup over DaDianNao 1.7x energy efficiency

Bit-Pragmatic Deep Neural Network Computing
J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1:

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Figure 5a) Baseline Tile

Figure 5b) Pragmatic Tile

Figure 6: PIP

Figure 7a) 2 Stage PIP

Figure 7b) 2 Stage Example

Figure 8: Run-ahead Example

Table 2: Precision Profiles

Figure 9: Perf 2 Stage

Table 3: 2 Stage Area/Power

Figure 10: Perf Run-ahead

Table 4: Run-ahead Area/Power

Figure 11: Perf IOE

Figure 12: Efficiency

Table 5: Area/Power Breakdown

Table 6: Tile Configurations

Table 7: 8-bit Designs

Bit-Pragmatic Deep Neural Network Computing

Similar presentations

Presentation on theme: "Bit-Pragmatic Deep Neural Network Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bit-Pragmatic Deep Neural Network Computing

Similar presentations

Presentation on theme: "Bit-Pragmatic Deep Neural Network Computing"— Presentation transcript:

Similar presentations

About project

Feedback