Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bit-Pragmatic Deep Neural Network Computing

Similar presentations


Presentation on theme: "Bit-Pragmatic Deep Neural Network Computing"— Presentation transcript:

1 Bit-Pragmatic Deep Neural Network Computing
J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1: This work was done while J. Albericio was a postdoc at the UofT.

2 Context Goal: Accelerate for deep learning inference with custom hardware Maximize energy efficiency Focus on Convolutional Neural Networks Dominated by inner products of activations and weights

3 We will show There are lots of ineffectual bits -> ineffectual computation An energy efficient accelerator design to exploit this

4 Motivation – ineffectual computation
101111 Weight X 001010 Activation 000000 101111 + 000000 What do we mean by ineffectual bits? Consider this textbook example of binary multiplication of one weight and one activation 101111 000000 000000

5 Motivation – ineffectual computation
101111 X 001010 Zero-bits lead to Zero terms = Ineffectual Computation 000000 101111 + 000000 Each zero bit in the multiplier results in a row of zeros to be added. We consider this to be ineffectual computation, making these zero bits ineffectual. 101111 000000 000000

6 Motivation – ineffectual computation
001010 101111 X + 000000 Effectual Bits Conversely, only the 1 bits are effectual and lead to effectual computation

7 * Current Approaches 1 1 * 1s can be ineffectual too A0 A1 A2 A3
Reduced Precision A0 A1 A2 A3 Activations Sparsity 1 Consider 4 16-bit activations The current approaches to deal with ineffectual computation in neural networks is to exploit reduced precision and sparsity We focus on activations but the same applied to weights In a way these techniques skip ineffectual bits at a course granularity 1

8 Goal 1 1 A0 A1 A2 A3 Reduced Precision Sparsity Effectual Bits
Activations 1 Sparsity Our goal is to go further and only process the effectual bits 1 Effectual Bits

9 Motivation – Zero bits in activations
Do we really need to go after every bit? Isn’t sparsity and reduced precision enough? With current approaches: 59% of bits are ineffectual

10 Process only the effectual bits
Cycle 1: Activation 101111 X 001010 1 1 000000 101111 Offset 1 + 000000 Weight + << 101111 101111 How do we compute only the effectual bits? 000000 000000

11 Process only the effectual bits
Cycle 2: Activation 101111 X 001010 1 1 000000 101111 Offset 3 + 000000 Weight + << 101111 101111 000000 000000

12 Baseline Inner Product Unit
16 Activation A0 32 X Weight W0 16 16 + + Activation A1 32 So that’s how we do one multiplication, how about computing an inner product In our baseline accelerator, DaDianNao, Weight W1 X 16

13 Shift and Add Inner Product Unit
Activation A0 1 4 Offsets K0 15 31 Weight W0 << 16 Activation A1 + + 1 4 Offsets K1 31 Weight W1 << Time is proportional to the number of 1 bits 16

14 Throughput – Baseline … … … … Activations 16x + “Brick” = 16 elements
Weights Change to triangles, annotate dimensions on grid 16 Units, 16 outputs per cycle 16 Filters

15 Throughput – Pragmatic
“Pallet” = 16 bricks Same position in 16 different windows Activations + << 16x Weights 256 Units, 256 outputs every N cycles Process all bits -> same throughput as baseline Skip ineffectual bits -> Speedup 16 Filters

16 Optimization: 2 stage shifting
+ << 2 19 16 4 2 Stage 4 31 << 16 + + 31 Complete shifters in front of the adder tree will make it very big. The positions of different one-bits are not very far apart. So we break the shifting in two parts << 16 Latency > Power Savings Latency < Power Savings

17 2 stage shifting - example
1 A0 A1 Cycle 1: 2 Offset K0 W0 1 << + 2 << Offset K1 2 W1 <<

18 2 stage shifting - example
1 Cycle 2: A1 2 Offset K0 1 W0 4 << + 2 << Offset K1 W1 <<

19 2 stage shifting - example
1 Cycle 3: A1 2 Offset K0 W0 6 << + 2 << Offset K1 W1 <<

20 2 stage shifting - example
1 Cycle 4: A1 2 Offset K0 W0 10 << + 2 << Trade off: potential slowdown 3 ones per activation, but takes 4 cycles Offset K1 W1 <<

21 Optimization – Run-ahead
Allow columns to start next brick without waiting for slower columns 1 A0 A1 A2 Brick 17 A0 1 1 1 A0 1 1 Brick 0 Brick 1 A1 1 1 A1 1 A2 1 1 1 A2 1 Different columns operate on different windows, so Requires more scheduling to deliver new weight bricks to individual columns + << + <<

22 Optimization – Improved Offset Encoding
Use booth like encoding to remove strings of 1s 1 A0 { , , -5, } K0 6 Ones 4 Terms Negative terms are already supported by the pipeline

23 Methodology Performance: in-house cycle-level simulator Datapath:
Synthesis: Synopsys Design Compiler, TSMC 65nm Layout: Cadense Encounter Activity: Mentor Graphics ModelSim Memory Models: eDRAM: Destiny SRAM: CACTI Networks: out-of-the-box Caffe model zoo

24 Results Performance Energy Efficiency
2 Stage Shifting – Scaling Offset Width All Optimizations Energy Efficiency

25 Reduce offset to 2 bits: negligible slowdown
Scaling Offset Width N << 16 16 + 2N -1 + Evaluate with 6 CNNs Reduce offset to 2 bits: negligible slowdown

26 Performance = 4.31x vs. DaDianNao
With Run Ahead and improved offset encoding performance goes up to 4.31x Performance = 4.31x vs. DaDianNao

27 Efficiency = 1.70x vs. DaDianNao
Energy Efficiency Naïve design with 4 bit offsets is less efficient than the baseline 2 bit offsets with run-ahead improves efficiency, but still worse than STR Finally with improved encoding we get significantly better energy efficiency Efficiency = 1.70x vs. DaDianNao

28 Remaining Ineffectual bits
7.6% effectual bits -> 13x ideal speedup Performance is lost to load imbalance within bricks and columns + << Brick Size There is more potential to exploit

29 Conclusion Pragmatic offers performance proportional to
Numerical precision Number of effectual bits Three optimizations to increase performance and reduce power 2 stage shifting Run ahead Improved Offset Encoding 4.3x speedup over DaDianNao 1.7x energy efficiency

30 Bit-Pragmatic Deep Neural Network Computing
J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1:

31 Figure 1

32 Table 1

33 Figure 2

34 Figure 3

35 Figure 4

36 Figure 5a) Baseline Tile

37 Figure 5b) Pragmatic Tile

38 Figure 6: PIP

39 Figure 7a) 2 Stage PIP

40 Figure 7b) 2 Stage Example

41 Figure 8: Run-ahead Example

42 Table 2: Precision Profiles

43 Figure 9: Perf 2 Stage

44 Table 3: 2 Stage Area/Power

45 Figure 10: Perf Run-ahead

46 Table 4: Run-ahead Area/Power

47 Figure 11: Perf IOE

48 Figure 12: Efficiency

49 Table 5: Area/Power Breakdown

50 Table 6: Tile Configurations

51 Table 7: 8-bit Designs


Download ppt "Bit-Pragmatic Deep Neural Network Computing"

Similar presentations


Ads by Google