Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 9, 2011L03-1
Bluespec: Two-Level Compilation Object code (Verilog/C) Rules and Actions (Term Rewriting System) Rule conflict analysis Rule scheduling James Hoe & Arvind Bluespec (Objects, Types, Higher-order functions) Level 1 compilation Type checking Massive partial evaluation and static elaboration Level 2 synthesis Lennart Augustsson Now we call this Guarded Atomic Actions February 9, 2011 L03-2http://csg.csail.mit.edu/6.375
Static Elaboration.exe compile design2design3design1 elaborate w/params run1 run2.1 … run1 run3.1 … run1 run1.1 … run w/ params run w/ params run1 … run At compile time Inline function calls and unroll loops Instantiate modules with specific parameters Resolve polymorphism/overloading, perform most data structure operations source Software Toolflow: source Hardware Toolflow: February 9, 2011 L03-3http://csg.csail.mit.edu/6.375
Combinational IFFT in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,... * * * * *j t2t2 t0t0 t3t3 t1t1 February 9, 2011 L03-4http://csg.csail.mit.edu/6.375
4-way Butterfly Node function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible * * * * *i t0t1t2t3t0t1t2t3 k0k1k2k3k0k1k2k3 February 9, 2011 L03-5http://csg.csail.mit.edu/6.375
BSV code: 4-way Butterfly function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m, y, z; m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction Polymorphic code: works on any type of numbers for which *, + and - have been defined * * * * *i my z Note: Vector does not mean storage February 9, 2011 L03-6http://csg.csail.mit.edu/6.375
Complex Arithmetic Addition z R = x R + y R z I = x I + y I Multiplication z R = x R * y R - x I * y I z I = x R * y I + x I * y R The actual arithmetic for FFT is different because we use a non-standard fixed point representation February 9, 2011 L03-7http://csg.csail.mit.edu/6.375
BSV code for Addition typedef struct{ Int#(t) r; Int#(t) i; } Complex#(numeric type t) deriving (Eq,Bits); function Complex#(t) \+ (Complex#(t) x, Complex#(t) y); Int#(t) real = x.r + y.r; Int#(t) imag = x.i + y.i; return(Complex{r:real, i:imag}); endfunction What is the type of this + ? February 9, 2011 L03-8http://csg.csail.mit.edu/6.375
Combinational IFFT in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute stage_f function repeat stage_f three times function Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); February 9, 2011 L03-9http://csg.csail.mit.edu/6.375
BSV Code: Combinational IFFT function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for-loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution February 9, 2011 L03-10http://csg.csail.mit.edu/6.375
BSV Code: Combinational IFFT- Unfolded function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); February 9, 2011 L03-11http://csg.csail.mit.edu/6.375
Bluespec Code for stage_f function Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); twid’s are mathematically derivable constants February 9, 2011 L03-12http://csg.csail.mit.edu/6.375
Higher-order functions: Stage functions f1, f2 and f3 function f0(x); return (stage_f(0,x)); endfunction function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); endfunction What is the type of f0(x) ? function Vector#(64, Complex) f0 (Vector#(64, Complex) x); February 9, 2011 L03-13http://csg.csail.mit.edu/6.375
Suppose we want to reuse some part of the circuit... in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute Reuse the same circuit three times to reduce area But why? February 9, 2011 L03-14http://csg.csail.mit.edu/6.375
Architectural Exploration: Area-Performance tradeoff in a Transmitter February 9, 2011L
802.11a Transmitter Overview ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for 85% area 24 Uncoded bits One OFDM symbol (64 Complex Numbers) Must produce one OFDM symbol every 4 sec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol February 9, 2011 L03-16http://csg.csail.mit.edu/6.375
Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind Design Lines of Block Code (BSV) Controller 49 Scrambler 40 Conv. Encoder 113 Interleaver 76 Mapper 112 IFFT 95 Cyc. Extender 23 Complex arithmetic libraries constitute another 200 lines of code Relative Area 0% 1% 11% 85% 3% February 9, 2011 L03-17http://csg.csail.mit.edu/6.375
Combinational IFFT in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute Reuse the same circuit three times to reduce area February 9, 2011 L03-18http://csg.csail.mit.edu/6.375
fg Design Alternatives Reuse a block over multiple cycles we expect: Throughput to Area to ff g decrease – less parallelism The clock needs to run faster for the same throughput hyper-linear increase in energy decrease – reusing a block February 9, 2011 L03-19http://csg.csail.mit.edu/6.375
Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute Stage Counter February 9, 2011 L03-20http://csg.csail.mit.edu/6.375
Superfolded circular pipeline: Just one Bfly-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Bfly4 Permute Index == 15? Index: 0 to 15 64, 2-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes Stage 0 to 2 February 9, 2011 L03-21http://csg.csail.mit.edu/6.375
Pipelining a block inQoutQ f2f1 f3 Combinational C inQoutQ f2f1 f3 Pipeline P inQoutQ f Folded Pipeline FP Clock? Area?Throughput? Clock: C < P FP Area: FP < C < PThroughput: FP < C < P February 9, 2011 L03-22http://csg.csail.mit.edu/6.375
Inelastic pipeline x sReg1inQ f0f1f2 sReg2outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule This is real IFFT code; just replace f0, f1 and f2 with stage_f code This rule can fire only if Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated - inQ has an element - outQ has space February 9, 2011 L03-23http://csg.csail.mit.edu/6.375
Stage functions f1, f2 and f3 function f0(x); return (stage_f(0,x)); endfunction function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); endfunction The stage_f function was given earlier February 9, 2011 L03-24http://csg.csail.mit.edu/6.375
Problem: What about pipeline bubbles? x sReg1inQ f0f1f2 sReg2outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Modify the rule to deal with these conditions Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type February 9, 2011 L03-25http://csg.csail.mit.edu/6.375
The Maybe type data in the pipeline typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= tagged Valid f0(inQ.first()); inQ.deq(); end else sReg1 <= tagged Invalid; case (sReg1) matches tagged Valid.sx1: sReg2 <= tagged Valid f1(sx1); tagged Invalid: sReg2 <= tagged Invalid; endcase case (sReg2) matches tagged Valid.sx2: outQ.enq(f2(sx2)); endcase endrule sx1 will get bound to the appropriate part of sReg1 February 9, 2011 L03-26http://csg.csail.mit.edu/6.375
When is this rule enabled? rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= tagged Valid f0(inQ.first()); inQ.deq(); end else sReg1 <= tagged Invalid; case (sReg1) matches tagged Valid.sx1: sReg2 <= tagged Valid f1(sx1); tagged Invalid: sReg2 <= tagged Invalid; endcase case (sReg2) matches tagged Valid.sx2: outQ.enq(f2(sx2)); endcase endrule February 9, 2011 L03-27http://csg.csail.mit.edu/6.375 inQsReg1sReg2outQ NEVVNF NEVVF NEVINF NEVIF NEIVNF NEIVF NEIINF NEIIF EVVNF EVVF EVINF EVIF EIVNF EIVF EIINF EIIF yes No Yes No Yes yes No Yes No Yes1 yes Yes1 = yes but no change inQsReg1sReg2outQ
Next lecture Folded pipeline for FFT February 9, 2011L