Stripes: Bit-Serial Deep Neural Network Computing Patrick Judd, Jorge Albericio, Tayler Hetherington*, Tor M. Aamodt*, Andreas Moshovos *
Hello Kitty X 00101100 00101010 16 … 10 X 00101100 00101010 16 … 7 - Neural networks can use very low precision data - without affecting the accuracy of the output - However, the minimum precision varies between layers and between networks
+ fine-grained accuracy-performance Hello Kitty 10 7 X X 010110 1011 Performance: +92% Efficiency: +57% Vs. DaDianNao (16-bit) + fine-grained accuracy-performance trade-off We exploit this by using bit serial arithmetic to offer performance that is proportional to the precision so that every bit you shave off will give you additional performance and energy efficiency Comparing to the high performance 16-bit dadiannao accelerator we get 92% more performance and 57% more energy efficiency with no loss in accuracy Stripes offers additional performance and efficiency by trading off accuracy, where this tradeoff can be tuned at a per bit granularity 16 16 00101010 00101010 … …
~90% Hello Kitty Deep Neural Networks Fully Connected 3D Convolution Neurons Synapses Hello Kitty [1] Deep Neural networks are state of the art algorithms for tasks like image classification Neurons synapses Convolutions, fully connected 90% In this work we are only concerned with inference 3D Convolution Fully Connected [1] http://www.clipartbest.com/cliparts/9T4/bAR/9T4bAR5bc.jpeg
shared by all neurons in a layer Reduced Precision Dynamic Fixed Point Representation 00010101000.0000 p e Value = intp * 2e shared by all neurons in a layer We analyze the reduce precision tolerance of neural networks in the context of a dynamic fixed point representation Which is a p bit signed integer with a shared exponent In our case the exponent is shared among all the neurons in a layer
Required Precision Varies LeNet 3 3 GoogLeNet 10 7 We profile a set of popular networks to determine the minimum per layer precisions needed to maintain the baseline accuracy The numbers listed are the neuron precisions, since neurons on average require less precision than synapses Notice that many of these numbers don’t fit nicely into power of 2 datatypes VGG_19 10 13
In an ideal world, the amount of time to compute data should be proportional to the precision. So compared to a 16 bit baseline system we want speedup equal to 16/p. For example with 9 bits of precision we would like a speedup of 16/9
What is the computation? Inner products: X + Neurons: 10111110 10000000 00001010 10101010 Synapses: 10101011 10101000 10111010 10101010 The bulk of the computation in DNNs boils down to inner products. Specifically an inner product of a vector of neurons and a vector of synapses
How do we exploit precision for performance? 10101010 101010 p bits + Now how do we an inner product and exploiting precision? Performance proportional to precision “Speedup” = 1/p
How do we exploit precision for performance? 10101010 101010 p bits + X 10101010 101010 p bits + … For example with 9 bit neurons we would get a speedup of 16/9 How do we design an efficient architecture around this How do we do this efficiently on a high throughput SIMD accelerator like DaDianNao x16 Speedup = 16/p
Review: 3D convolutions Neurons (Input) Synapses (Filters) Neurons (Output) “Window” … … Each convolution layer takes a 3D array of neurons and a set of 3D arrays of synapses, called filters. -Each filter is applies to a window in the neuron array, computing an inner product and producing a single output value. Multiple filters are applied to multiple windows in the image to produce a 3D array of output neurons. …
+ Inner Product Unit “Brick” x16 DaDianNao Tile NBin SB NBout X The DaDianNao pipeline consists of 16 inner product units - where each unit has 16 multipliers and an adder tree - Neurons and Synapses are grouped into bricks of 16 elements Each cycle a brick of neurons is broadcast to all sips to be multiplied with the corresponding brick from 16 filters After a whole window is processed we are left with a brick of output neurons x16
+ + … 4Kb x16 NBout DaDianNao Tile X X X … X NBin SB 256b … x16 Tiles also include buffers for input and output neurons Data is stored in a 16 bit fixed point format Computation is performed on an array of 16 inner product units where each unit performs an inner product with 16 multipliers and an adder tree X + … X
+ 16 32 X x16 … 16 32 X Inner Product Unit 10101010 10101010 10101010 Now consider the inner product unit from our baseline accelerator X 32 10101010 10101010
Serial Inner Product Unit (SIP) 10101010 16 1 16 + + 32 … x256 A N D 10101010 16 1 We take the inner product unit, convert the multipliers to serial and do some straightforward optimizations, we get a Serial Inner Product unit (SIP). We also apply the scaling factor of 16 to get The SIP takes as input 256 neurons bit serially and 256 synapses bit-parallel <<
Naïve Tile … x16 256b 64Kb 256 … I said that we need to scale the units by 16 to get the ideal 16/p performance, meaning 256 SIPs. If we naïvely scale the system vertically we will need a 64Kb interface to the synapse buffer, which is undesirable … x16
… 4Kb … same interface width x16 x16 x16 Stripes Tile NBin SB 256b -We want to maintain the memory interface of the baseline. -And now each SIP takes 16 pairs of inputs To compensate for the latency of the serial computation we lay out more SIPs in parallel We create a 16x16 array or SIPs where rows share synapses x16 NBout NBout
Bricks in the same position in different windows Stripes Data Mapping Bricks in the same position in different windows x16 In order to reuse synapses, we need to fetch neurons from the same position in different windows Remember that neurons are transmitted bit-serially, So this bus contains one bit from each neuron in the brick From the tiles perspective, we get the ideal 16/p throughput, while only changing the compute pipeline x16
Dispatcher: Providing Serial Neurons Tile SB SB NM Dispatcher SB SB So far we’ve looked at one tile, but DaDianNao is made up of 16 tiles, all fed by a central Neuron Memory. Stripes adds a Dispatcher to the NM to generate the necessary stream of bit-serial neurons.
Dispatcher NM 10101000 00101100 4Kb DISPATCHER 256b 001011 … … 1 Collects 16 bricks from NM and streams the neurons out bit-serially, 256 bits per cycle -Maintains wide interface to neuron memory -Same number of interconnect wires as the baseline (256) 101010
Evaluation Caffe analyze precision’s effect on accuracy Custom simulator baseline and Stripes Power/Area Tile Compute Logic + Dispatcher Synthesis with Synopsys Design Compiler, Layout with Cadence Encounter for TSMC 65nm Cacti for DRAM buffers (NBin, NBout) Destiny for eDRAM (SB, NM)
Performance – Accuracy Trade-off Performance for all layers Results Area & Power overheads Convolutions: Speedup Energy Performance – Accuracy Trade-off Performance for all layers
≈ 1/8 (16x16) ≈ 2x Fullchip overhead: 32% area 34% power Area and Power SIP vs. IP SIP array vs. IP array ≈ 1/8 (16x16) ≈ 2x A SIP is approximately 1/8th the cost of an IP, in both area and power So the 16x16 SIP array is approximately double the cost of the array of 16 IP units in the baseline Overall, Stripes costs 32% more area and 34% more power than the baseline Fullchip overhead: 32% area 34% power
Convolution Layers: Speedup vs. Ideal We will focus on convolutional layers We plot speedup for all the networks, comparing the simulated speedup of stripes (blue) to the ideal speedup (green) Speedups range from 5.3x for LeNet to 1.35x for VGG 19 On average the speedup is 2.24x This is within 2% of the ideal speedup Avg Speedup = 2.24x ~2% of Ideal
Avg. Efficiency = 1.57x vs. Base WT Energy Efficiency This graph show energy efficiency, calculated as the relative energy to compute all the convolutional layers (Ebase/Estr) We consider multiple scheduling schemes to optimize for energy Baseline scheduling accesses the SB every cycle (Efficiency=1) Base Batch (BLUE) uses batching to reuse synapses read from SB Window Tiling (GREEN) processes multiple windows to reuse SB Stripes (RED) is still more efficient by 1.57x on average Avg. Efficiency = 1.57x vs. Base WT
Performance vs. Accuracy Trade-off 99% Up until now we have been considering precisions that give us 100% of the baseline accuracy What if we loosen this requirement? If we drop precisions beyond this critical point how much more performance can we get On the x axis is the network accuracy relative to the baseline full precision accuracy On the y axis is the incremental speedup relative to stripes with 100% relative accuracy Speedup: 2.48x (+11%) Efficiency: 68%
Avg. (100%) = 1.92x Avg. (99%) = 2.08x All layers: Speedup Now lets consider the computation of all the layers. This graph shows speedup for the whole network when the relative accuracy is 100% (blue) and 99% (green) Stripes does not accelerate the other layers, so the speedup is diluted, mostly by the large fully connected layers of some networks, which are off-chip memory bound. Avg. (100%) = 1.92x Avg. (99%) = 2.08x
Conclusion Stripes: practical SIMD Exploits precision for performance Without affecting classification accuracy: 1.92x performance + 1.57x energy efficiency 32% area overhead On-the-fly per-bit precision tuning: trade accuracy for additional performance: 99% accuracy 2.08x performance + 1.68x efficiency
Questions: patrick.judd@mail.utoronto.ca Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing Stripes: Bit-Serial Deep Neural Network Computing This work is part of a larger effort to exploit numerical properties of neural networks in hardware We invite you to read out other papers Questions: patrick.judd@mail.utoronto.ca