A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1
DA is a bit-serial technique to greatly reduce resource requirements for the dot product calculation So-called because the resources are not easily recognizable: “Where’s the MAC module?” Takes advantage of small tables of pre- computed coefficients and clever rearrangement of the math 2
In signal processing the most common operation is the dot product DA lends itself well to FPGA implementation due its use of lookup tables DA can reduce gate count by 50%-80% in signal processing arithmetic! 3
It turns out that the dot product is used extensively in DSP (FIR, FFT, etc) Recall that dot product is a sum of products: Written as a summation: 4
Simple example: smoothing data via DSP (low-pass filter) Accomplished with an FIR filter. General form: So we could implement a “3-tap (K=4) moving average filter”: (In this special case, A 1 =A 2 =A 3 =0.33) 5
Recall the goal: X is the filter input, (digital!), so let’s consider two’s complement representation (scaled x<1 for cleanliness) Putting them together N – total bits 6
Expand the summation: We can precompute all terms that depend on the input data (b k0..b kK ) and store them in a ROM of size 2 K+1 The x inputs can then be used to address the ROM directly: LUT! Since b kn is 0 or 1, this has only 2 K possible values Two possible values 7
Non-DA Hardware Implementation 8-bit Multiplier 8-bit Adder Based on the original equation 8
We said this is ‘bit-serial’ technique, so how can we perform multiplication? Here, x is 4-bit input and A is 8-bit constant Example Multiplication x = 1011 A = Shift right by 1 Result register x A AND with 1 parallel and 1 serial input 9
So, now we substitute the scaling accumulator into our original design. Getting closer... 10
Let’s rearrange the hardware to match our expanded eq n : We first sum the products of each input bit and its constant Then we add and scale each of those terms 11
Now recall that we had the clever idea to use pre- computed sums in a LUT for the bitwise addition AddressData C0C0 0010C1C1 0011C 0 +C C 0 +C 1+ C C 0 +C 1 +C 2 +C 3 12
We need to accommodate the negative term, so we add one more address line to the LUT called T s. ROM size now 2 K+1 T s is a timing signal. T s =1 during sign bit time, 0 otherwise We also need this bit to know when the final result is ready AddressData C (C 0 +C 1 +C 2 +C 3 ) For all T s = 1 the ROM contains the negative of the appropriate sum 13
This is an example of K=4 DA dot-product hardware ROM Size = 2 K+1 =2 5 =32 Here is our scaling accumulator Switch SWA in pos 2 after Ts=1, at which point y contains final result 14
Computes N-bit dot product in N cycles Reduced area and high speed due to the ROM However, requires 2 K+1 size ROM (grows exponentially with input lines) Input sizes often 16 bits -> Need 128K ROM! 15
Bit-serial means N-bit dot product requires N cycles... Slower than parallel? N HW multipliers not generally practical due to large area\power! Time-multiplexing your parallel HW multiplier means you lose the speed gain: N vs K Example: K=8, N=8 takes the same time on time multiplexed parallel HW vs DA bit-serial 16
We can reduce the ROM size to 2 K with some tricks There are other math tricks to reduce the size further to 2 K-1 Replace adder with adder/subtractor T s becomes control line for adder/subtractor ROM size is reduced by half 17
Speed determined by serial nature of input – 1 BAAT We can expand the HW to do multi-bit at a time Introduce input as bit pairs x 10 x 11, x 12 x 13, etc Shift LSB of pair result by 1 Shift accumulator feedback by 2 Requires 2 ROMs instead of 1 18
DA lends itself easily to DSP because of its easy application to the dot product DA is easily implementable on FPGA because of the similar architecture-> LUTs (of course better on custom hardware) DA is not limited to dot product; will work for any algorithm where pre-computed values can be leveraged 19
DA is a very efficient means of mechanizing the dot product The use of DA can save 50-80% area over the parallel approach Like everything, DA has tradeoffs: ROM size input lines Speed area (multi ROM) 20
Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. White, Stanley. IEEE ASSP Magazine July 1989 (I pulled most of the basic talk info from here) Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing IX, (this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example) Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS Software Receivers. Waelchli, G et al. Journal of Electrical and Computer Engineering volume 2010 (application to GPS) An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) (DSP example using a Virtex FPGA) 21