Combinational Circuits and Simple Synchronous Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 20, 2008 http://csg.csail.mit.edu/6.375
Bluespec: Two-Level Compilation (Objects, Types, Higher-order functions) Lennart Augustsson @Sandburst 2000-2002 Type checking Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) Now we call this Guarded Atomic Actions Rule conflict analysis Rule scheduling Level 2 synthesis James Hoe & Arvind @MIT 1997-2000 Object code (Verilog/C) http://csg.csail.mit.edu/6.375 February 20, 2008
Static Elaboration At compile time Software Toolflow: Hardware Inline function calls and unroll loops Instantiate modules with specific parameters Resolve polymorphism/overloading, perform most data structure operations source Software Toolflow: source Hardware Toolflow: design2 design3 design1 elaborate w/params .exe compile run1 run2.1 … run3.1 run1.1 run w/ params run w/ params run1 … run http://csg.csail.mit.edu/6.375 February 20, 2008
Combinational IFFT … * + - *j t2 t0 t3 t1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... http://csg.csail.mit.edu/6.375 February 20, 2008
4-way Butterfly Node BSV has a very strong notion of types * + - *i t0 k0 k1 k2 k3 function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible http://csg.csail.mit.edu/6.375 February 20, 2008
BSV code: 4-way Butterfly function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m, y, z; m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage http://csg.csail.mit.edu/6.375 February 20, 2008
Combinational IFFT … stage_f function repeat it three times in0 in1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute stage_f function repeat it three times http://csg.csail.mit.edu/6.375 February 20, 2008
BSV Code: Combinational IFFT function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution http://csg.csail.mit.edu/6.375 February 20, 2008
BSV Code: Combinational IFFT- Unfolded function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? http://csg.csail.mit.edu/6.375 February 20, 2008
Bluespec Code for stage_f function Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); twid’s are mathematically derivable constants http://csg.csail.mit.edu/6.375 February 20, 2008
Suppose we want to reuse some part of the circuit ... Reuse the same circuit three times to reduce area in0 … in1 in2 in63 in3 in4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute But why? http://csg.csail.mit.edu/6.375 February 20, 2008
Architectural Exploration: Area-Performance tradeoff in 802.11a Transmitter February 20, 2008 http://csg.csail.mit.edu/6.375
802.11a Transmitter Overview headers Must produce one OFDM symbol every 4 msec Controller Scrambler Encoder 24 Uncoded bits data Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Interleaver Mapper IFFT Cyclic Extend One OFDM symbol (64 Complex Numbers) IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers http://csg.csail.mit.edu/6.375 accounts for 85% area February 20, 2008
Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind Design Lines of Relative Block Code (BSV) Area Controller 49 0% Scrambler 40 0% Conv. Encoder 113 0% Interleaver 76 1% Mapper 112 11% IFFT 95 85% Cyc. Extender 23 3% Complex arithmetic libraries constitute another 200 lines of code http://csg.csail.mit.edu/6.375 February 20, 2008
Combinational IFFT … Reuse the same circuit three times to reduce area Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute http://csg.csail.mit.edu/6.375 February 20, 2008
Design Alternatives Reuse a block over multiple cycles we expect: f g f g we expect: Throughput to Area to decrease – less parallelism decrease – reusing a block The clock needs to run faster for the same throughput hyper-linear increase in energy http://csg.csail.mit.edu/6.375 February 20, 2008
Circular pipeline: Reusing the Pipeline Stage … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute Stage Counter http://csg.csail.mit.edu/6.375 February 20, 2008
Superfolded circular pipeline: Just one Bfly-4 node! … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Permute Bfly4 64, 2-way Muxes Stage 0 to 2 4, 16-way Muxes Index: 0 to 15 4, 16-way DeMuxes Index == 15? http://csg.csail.mit.edu/6.375 February 20, 2008
Pipelining a block inQ outQ f2 f1 f3 Combinational C inQ outQ f2 f1 f3 Pipeline P inQ outQ f Folded Pipeline FP Clock: C < P FP Clock? Area? Throughput? Area: FP < C < P Throughput: FP < C < P http://csg.csail.mit.edu/6.375 February 20, 2008
Synchronous pipeline x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated This is real IFFT code; just replace f1, f2 and f3 with stage_f code http://csg.csail.mit.edu/6.375 February 20, 2008
Stage functions f1, f2 and f3 function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); function f3(x); return (stage_f(3,x)); The stage_f function was given earlier http://csg.csail.mit.edu/6.375 February 20, 2008
Problem: What about pipeline bubbles? x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions http://csg.csail.mit.edu/6.375 February 20, 2008
The Maybe type data in the pipeline typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule sx1 will get bound to the appropriate part of sReg1 http://csg.csail.mit.edu/6.375 February 20, 2008
Folded pipeline x sReg inQ f outQ stage The same code will work for superfolded pipelines by changing n and stage function f rule folded-pipeline (True); if (stage==0) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n-1) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n-1)? 0 : stage+1; endrule no for-loop notice stage is a dynamic parameter now! Need type declarations for sxIn and sxOut http://csg.csail.mit.edu/6.375 February 20, 2008
802.11a Transmitter Synthesis results (Only the IFFT block is changing) IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 Folded (16 Bfly-4s) 3.97 Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! The same source code TSMC .18 micron; numbers reported are before place and route. http://csg.csail.mit.edu/6.375 February 20, 2008
Why are the areas so similar Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction! http://csg.csail.mit.edu/6.375 February 20, 2008
Language notes Pattern matching syntax Vector syntax Implicit conditions http://csg.csail.mit.edu/6.375 February 20, 2008
Pattern-matching: A convenient way to extract datastructure components typedef union tagged { void Invalid; t Valid; } Maybe#(type t); case (m) matches tagged Invalid : return 0; tagged Valid .x : return x; endcase x will get bound to the appropriate part of m if (m matches (Valid .x) &&& (x > 10)) The &&& is a conjunction, and allows pattern-variables to come into scope from left to right http://csg.csail.mit.edu/6.375 February 20, 2008
Syntax: Vector of Registers suppose x and y are both of type Reg. Then x <= y means x._write(y._read()) Vector of Int x[i] means sel(x,i) x[i] = y[j] means x = update(x,i, sel(y,j)) Vector of Registers x[i] <= y[j] does not work. The parser thinks it means (sel(x,i)._read)._write(sel(y,j)._read), which will not type check (x[i]) <= y[j] parses as sel(x,i)._write(sel(y,j)._read), and works correctly Don’t ask me why http://csg.csail.mit.edu/6.375 February 20, 2008
Making guards explicit rule recirculate (True); if (p) fifo.enq(8); r <= 7; endrule rule recirculate ((p && fifo.enqG) || !p); if (p) fifo.enqB(8); r <= 7; endrule Effectively, all implicit conditions (guards) are lifted and conjoined to the rule guard http://csg.csail.mit.edu/6.375 February 20, 2008
Implicit guards (conditions) Rule rule <name> (<guard>); <action>; endrule where <action> ::= r <= <exp> | m.g(<exp>) | if (<exp>) <action> endif | <action> ; <action> m.gB(<exp>) when m.gG make implicit guards explicit http://csg.csail.mit.edu/6.375 February 20, 2008
Guards vs If’s A guard on one action of a parallel group of actions affects every action within the group (a1 when p1); (a2 when p2) ==> (a1; a2) when (p1 && p2) A condition of a Conditional action only affects the actions within the scope of the conditional action (if (p1) a1); a2 p1 has no effect on a2 ... Mixing ifs and whens (if (p) (a1 when q)) ; a2 ((if (p) a1); a2) when ((p && q) | !p) http://csg.csail.mit.edu/6.375 February 20, 2008