Architectural Exploration: Area-Power tradeoff in 802.11a transmitter design Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 16, 2007 http://csg.csail.mit.edu/6.375/
This lecture has two purposes Illustrate how area-power tradeoff can be studied at a high-level for a realistic design Example: 802.11a transmitter Illustrate some features of BSV Static elaboration Combinational circuits Simple synchronous pipelines Valid bits as the Maybe type in BSV No prior understanding of 802.11a is necessary to follow this lecture http://csg.csail.mit.edu/6.375/ February 16, 2007
Bluespec: Two-Level Compilation (Objects, Types, Higher-order functions) Lennart Augustsson @Sandburst 2000-2002 Type checking Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) Now we call this Guarded Atomic Actions Rule conflict analysis Rule scheduling Level 2 synthesis James Hoe & Arvind @MIT 1997-2000 Object code (Verilog/C) http://csg.csail.mit.edu/6.375/ February 16, 2007
Static Elaboration At compile time Software Toolflow: Hardware Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading source Software Toolflow: source Hardware Toolflow: design2 design3 design1 elaborate w/params .exe compile run1 run2.1 … run3.1 run1.1 run w/ params run w/ params run2 run1 … run3 http://csg.csail.mit.edu/6.375/ February 16, 2007
802.11a Transmitter Overview headers Must produce one OFDM symbol every 4 msec Controller Scrambler Encoder 24 Uncoded bits data Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Interleaver Mapper IFFT Cyclic Extend One OFDM symbol (64 Complex Numbers) IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers http://csg.csail.mit.edu/6.375/ accounts for 85% area February 16, 2007
Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind Design Lines of Relative Block Code (BSV) Area Controller 49 0% Scrambler 40 0% Conv. Encoder 113 0% Interleaver 76 1% Mapper 112 11% IFFT 95 85% Cyc. Extender 23 3% Complex arithmetic libraries constitute another 200 lines of code http://csg.csail.mit.edu/6.375/ February 16, 2007
Combinational IFFT … * + - *j t2 t0 t3 t1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... http://csg.csail.mit.edu/6.375/ February 16, 2007
4-way Butterfly Node BSV has a very strong notion of types * + - *i t0 k0 k1 k2 k3 function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible http://csg.csail.mit.edu/6.375/ February 16, 2007
BSV code: 4-way Butterfly function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage http://csg.csail.mit.edu/6.375/ February 16, 2007
Combinational IFFT … stage_f function repeat it three times in0 in1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 stage_f function repeat it three times http://csg.csail.mit.edu/6.375/ February 16, 2007
BSV Code: Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution http://csg.csail.mit.edu/6.375/ February 16, 2007
BSV Code: Combinational IFFT- Unfolded function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? http://csg.csail.mit.edu/6.375/ February 16, 2007
Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); http://csg.csail.mit.edu/6.375/ February 16, 2007
Architectural Exploration February 16, 2007 http://csg.csail.mit.edu/6.375/
Design Alternatives Reuse a block over multiple cycles we expect: f g f g we expect: Throughput to Area to decrease – less parallelism decrease – reusing a block Energy/unit work ? The clock needs to run faster for the same throughput hyper-linear increase in energy more on power issues later http://csg.csail.mit.edu/6.375/ February 16, 2007
Combinational IFFT Opportunity for reuse … in1 in2 in63 in3 in4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 Reuse the same circuit three times http://csg.csail.mit.edu/6.375/ February 16, 2007
Circular pipeline: Reusing the Pipeline Stage … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute_1 64, 4-way Muxes Permute_2 Stage Counter 16 Bfly4s can be shared but not the three permutations. Hence the need for muxes Permute_3 http://csg.csail.mit.edu/6.375/ February 16, 2007
Superfolded circular pipeline: Just one Bfly-4 node! … in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes http://csg.csail.mit.edu/6.375/ February 16, 2007
Algorithmic Improvements in0 … in1 in2 in63 in3 in4 Bfly-4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 1. All the three permutations can be made identical more saving in area in the folded case 2. One multiplication can be removed from Bfly-4 http://csg.csail.mit.edu/6.375/ February 16, 2007
Area improvements because of change in Algorithm Design Old Area New Area (mm2) (mm2) Combinational 4.69 4.91 Simple Pipe 5.14 5.25 Folded Pipe 5.89 3.97 http://csg.csail.mit.edu/6.375/ February 16, 2007
Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL 5-minute break to stretch you legs http://csg.csail.mit.edu/6.375/ February 16, 2007
Pipelining a block inQ outQ f2 f1 f3 Combinational C inQ outQ f2 f1 f3 Pipeline P inQ outQ f Folded Pipeline FP Clock: C < P FP Clock? Area? Throughput? Area: FP < C < P Throughput: FP < C < P http://csg.csail.mit.edu/6.375/ February 16, 2007
Synchronous pipeline x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated This is real IFFT code; just replace f1, f2 and f3 with stage_f code http://csg.csail.mit.edu/6.375/ February 16, 2007
Stage functions f1, f2 and f3 function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); function f3(x); return (stage_f(3,x)); The stage_f fucntion is given on slide 12 http://csg.csail.mit.edu/6.375/ February 16, 2007
Problem: What about pipeline bubbles? x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions http://csg.csail.mit.edu/6.375/ February 16, 2007
The Maybe type data in the pipeline typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inq.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule http://csg.csail.mit.edu/6.375/ February 16, 2007
Folded pipeline x sReg inQ f outQ stage The same code will work for superfolded pipelines by changing n and stage function f rule folded-pipeline (True); if (stage==1) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n)? 1 : stage+1; endrule notice stage is a dynamic parameter now! http://csg.csail.mit.edu/6.375/ February 16, 2007
802.11a Transmitter Synthesis results (Only the IFFT block is changing) IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 Folded (16 Bfly-4s) 3.97 Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! The same source code TSMC .18 micron; numbers reported are before place and route. http://csg.csail.mit.edu/6.375/ February 16, 2007
Why are the areas so similar Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction! http://csg.csail.mit.edu/6.375/ February 16, 2007
Summary It is essential to do architectural exploration for better (area, power, performance, ...) designs. It is possible to do so with new design tools and methodologies, i.e., Bluespec Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. http://csg.csail.mit.edu/6.375/ February 16, 2007
Bluespec Learnings How to write highly parameterized combinational codes How to write rules for simple synchronous pipelines Effect of dynamic vs static values on generated circuits Using Maybe types to express valid/invalid data Thanks http://csg.csail.mit.edu/6.375/ February 16, 2007
Backup slides February 16, 2007 http://csg.csail.mit.edu/6.375/
Function f for the folded pipeline is the same stage_f function but ... function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); will cause a mux to be generated http://csg.csail.mit.edu/6.375/ February 16, 2007
Folded pipeline: stage function f getTwiddle getTwiddle2 getTwiddle3 twid The rest of stage_f, i.e. Bfly-4s and permutations (shared) sx http://csg.csail.mit.edu/6.375/ February 16, 2007
Function f for the Superfolded pipeline (One Bfly-4 case) f will be invoked for 48 dynamic values of stage each invocation will modify 4 numbers in sReg after 16 invocations a permutation would be done on the whole sReg http://csg.csail.mit.edu/6.375/ February 16, 2007
Code for the Superfolded pipeline stage function function SVector#(64, Complex) f (Bit#(6) stage, SVector#(64, Complex) stage_in); begin let idx = stage `mod` 16; let twid = getTwiddle(stage `div` 16, idx); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp = stage_in; stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return((idx == 15) ? stage_out: stage_temp); One Bfly-4 case http://csg.csail.mit.edu/6.375/ February 16, 2007
Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006 Experimental Results Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006 February 16, 2007 http://csg.csail.mit.edu/6.375/
Expressing these designs in Bluespec was easy All these designs were done in less than one day! Designers were experts in Bluespec Area and power estimates? Combinational Pipelined Folded (16 Bfly-4s) Super-Folded (8 Bfly-4s) Super-Folded (4 Bfly-4s) Super-Folded (2 Bfly-4s) Super-Folded (1 Bfly-4) The same source code http://csg.csail.mit.edu/6.375/ February 16, 2007
Bluespec SystemVerilog source Bluespec Tool flow Blueview Bluespec SystemVerilog source Sequence Design PowerTheater Bluespec Compiler VCD output Debussy Visualization C Bluesim Cycle Accurate Verilog 95 RTL Verilog sim RTL synthesis gates Power estimation tool files Bluespec tools 3rd party tools Legend Physical Place & Route Tapeout FPGA http://csg.csail.mit.edu/6.375/ February 16, 2007
802.11a Transmitter Synthesis results (Only the IFFT block is changing) IFFT Design Area (mm2) Symbol Latency (CLKs) ThroughputLatency (CLKs/sym) Min. Freq Required Average Power (mW) Pipelined 5.25 12 04 1.0 MHz 4.92 Combinational 4.91 10 3.99 Folded (16 Bfly-4s) 3.97 7.27 Super-Folded (8 Bfly-4s) 3.69 15 06 1.5 MHz 10.9 SF(4 Bfly-4s) 2.45 21 3.0 MHz 14.4 SF(2 Bfly-4s) 1.84 33 24 6.0 MHz 21.1 SF (1 Bfly4) 1.52 57 48 12 MHZ 34.6 TSMC .18 micron; numbers reported are before place and route. (DesignCompiler), Power numbers are from Sequence PowerTheater http://csg.csail.mit.edu/6.375/ February 16, 2007
Power can be reduced further Right now all blocks in the transmitter run on the same clock if we run IFFT faster then all other blocks also run faster Bluespec has facilities for Multiple Clock Domains and the design can be easily modified to run the earlier blocks at a lower clock rate http://csg.csail.mit.edu/6.375/ February 16, 2007