Sabyasachi Das Synplicity Inc Sunil P. Khatri Texas A&M University A Timing-Driven Hybrid-Compression Algorithm for Faster Sum-of-Products Sabyasachi Das Synplicity Inc Sunil P. Khatri Texas A&M University
What is a Sum-of-Product? IC block that performs addition of multiple product and sum terms Computationally-intensive Wide usage in DSP, Graphics, Microprocessors p = a * b q = c * d d z = p + q + e + f e f z p q c b a
Examples of Sum-of-Product Blocks Multiplication {assign z = a * b} MAC {assign z = (a * b) + c} 2-operand Addition {assign z = a + b} Squarer {assign z = a * a} Adder-Tree {assign z = a + b + c + d} Generalized SOP {assign z = (a * b) + (c * d) + (e * f) + g + h + k}
Structure of Sum-of-Products Inputs Sum-of-Products block consists of 3 parts (written in the order of data-flow) Partial Product Generator (PPGen) Partial Product Reduction Tree (PPRT) Final Carry-Propagation Adder (CPA) Partial Product Generator (PPGen) Partial Product Reduction Tree (PPRT) Final Carry Propagation Adder (CPA) Output
Partial Product Reduction Tree In Partial Product Reduction Tree, total number of elements in each bit gets reduced to upto two Partial Product Reduction Tree (PPRT) consumes >50% delay of the SOP block Hence the performance of PPRT is crucial to the performance of the SOP block
Two Reduction Counters in PPRT Reduces 2 inputs (ai and bi) to 2 outputs (Si and Ci+1) (3:2) Counter Reduces 3 inputs (ai, bi and ci) to 2 outputs (Si and Ci+1) ai bi Ci+1 Si ai bi ci Ci+1 Si
4:3 Reduction Counters (4:3) Counter 4 inputs to 3 outputs The functionality of the Ci+2 is a 4-input AND gate. Faster reduction at ith column Produces element to (i+2)th column at an earlier time Has larger area than other two counters bi ci di Ci+2 Ci+1 Si ai Key idea is to use (4:3) counter as much as possible in conjunction with the (3:2) and (2:2) counters
Explanation of our approach Perform column-wise reduction (LSB to MSB) For each column (or BitSlice/BitCluster) Sort inputs based on arrival time Is (2:2) reduction fast? If yes, instantiate that Else is (3:2) reduction fast? If yes, instantiate that Else instantiate (4:3) reduction After each reduction, re-sort the signals and continue
An example of our approach P07 P06 P05 P04 P03 P02 P01 P00 P17 P16 P15 P14 P13 P12 P11 P10 P27 P26 P25 P24 P23 P22 P21 P20 P37 P36 P35 P34 P33 P32 P31 P30 P47 P46 P45 P44 P43 P42 P41 P40 P57 P56 P55 P54 P53 P52 P51 P50 C02 C01 S00 C11 S10 C03 S01 C13 C12 S02
Results On an average, our approach produces about 3.5% speed improvement with 4.3% area penalty
Summary A 4:3 reduction counter is designed Reduces elements in the given column at a faster pace Produces an element to the (i+2)th column at an earlier time 4:3 reduction counter is used extensively (in conjunction with the existing 3:2 and 2:2 counters) A timing-driven algorithm selects the correct type of counter that needs to be instantiated On an average, 3.5% improvement in speed with 4.3% area penalty.
Thank you