June 5, Architectural Exploration: a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology MIT-Nokia Architecture Group Helsinki, June 5, 2006
2 Why architectural exploration Architects are clever people and can think of a variety of designs But often cannot determine which design is best for a given metric (e.g., power) Too short of time and manpower to go far enough with several designs for proper evaluation Guess work instead of architectural exploration New design tools can change all that
3 This talk Architectural exploration of a transmitter The goal is to show that it is easy and economical to do so in Bluespec You don’t have to know a or Bluespec to understand the talk
a Transmitter Overview ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for > 95% area 24 Uncoded bits One OFDM symbol (64 Complex Numbers) Must produce one OFDM symbol every 4 sec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol
5 Combinational IFFT in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,... * * * * *j t2t2 t0t0 t3t3 t1t1
6 Design Tradeoffs 1.We can decrease the area by multiplexing some circuits It may be a win if the throughput requirements can be met without increasing the frequency 2.Power can be lowered by lowering the frequency, which can be adjusted by changing the voltage power (voltage) 2
7 Combinational IFFT Opportunity for reuse in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 Reuse the same circuit three times
8 Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Radix 4 Permute_1Permute_2Permute_3 Stage Counter 16 Radix 4s can be shared but not the three permutations. Hence the need for muxes 64, 4-way Muxes
9 Superfolded circular pipeline: Just one Radix-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Radix 4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes Designs with 2, 4, and 8 Radix-4 modules make sense too!
10 Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL
June 5, Expressing the designs in Bluespec
12 Bluespec code: Radix-4 Node function Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction Polymorphic code: works on any type of numbers for which *, + and - have been defined * * * * *j
13 Combinational IFFT Can be used as a reference in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 stage_f function repeat it three times
14 Bluespec Code for Combinational IFFT function SVector#(64, Complex) stage_f(Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx + 1] = y[1]; stage_temp[idx + 2] = y[2]; stage_temp[idx + 3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); Stage function The code is unfolded to generate a combinational circuit
15 Synchronous pipeline rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2( sReg1 ); outQ.enq(f3(sReg2)); endrule x sReg1inQ f1f2f3 sReg2outQ This is real IFFT code; just replace f1, f2 and f3 with stage_f code
16 Folded pipeline x sReg inQ rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1; endrule f outQ stage f1 f2 f3 function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcase endfunction This is real IFFT code too...
17 Expressing these designs in Bluespec is easy All these designs were done in less than one day! Area and power estimates? Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) Super-Folded (4 Radices) Super-Folded (2 Radices) Super-Folded (1 Radix) How long will it take to write these designs in Verilog? VHDL? SystemC?
18 Bluespec Tool flow Bluespec SystemVerilog source Verilog 95 RTL Verilog sim VCD output Debussy Visualization Bluespec Compiler RTL synthesis gates C Bluespec C sim Cycle Accurate FPGA Power estimatio n tool Sequence Design PowerTheater
a Transmitter Synthesis results for various IFFT designs IFFT DesignArea (mm 2 ) Min. CLK Period(ns) Latency (clks/Sym) ns/output (req 4000) Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) SF (4 Radices) SF (2 Radices) SF (1 Radix) TSMC.18 micron; numbers reported are before place and route. Some areas will be larger after layout.
20 Algorithmic Improvements in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 1. All the three permutations can be made identical more saving in area 2. One multiplication can be removed from Radix-4
a Transmitter Synthesis results: old vs. new IFFT designs IFFT DesignOld Area (mm 2 ) New Area (mm 2 ) Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) SF(4 Radices) SF(2 Radices) SF (1 Radix) TSMC.18 micron; numbers reported are before place and route. ??? expected
a Transmitter Synthesis results with new IFFT designs IFFT DesignArea (mm 2 ) Min. CLK Period (ns) Latency (clks/Sym bol) Min. ns/ output Permitted Clock scaling Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) SF(4 Radices) SF(2 Radices) SF (1 Radix) TSMC.18 micron; numbers reported are before place and route.
a Transmitter with new IFFT designs: Power Estimates IFFT Design c1 Area (mm 2 ) c2 Min Freq. c3 100MHz c4 Min Freq. c5 Energy/Symb (nJ) c6 Combinational5.911 MHz Pipeline (48 R-4)6.261 MHz Folded (16 R-4)4.611 MHz SF (8 R-4) MHz SF (4 R-4)2.753MHz SF (2 R-4)2.216MHz SF (1 R-4)1.6712MHz c3 = min clock x scaling factor; c4 is raw data collected by the Sequence Design PowerTheater c5 = c4xc3/100MHz/voltage scaling(=10); c6 = c5x4 sec Work in progress
24 Summary It is essential to do architectural exploration for better (area, power, performance,...) designs. It is possible to do so with new design tools and methodologies. Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. Thanks