Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 5, 20061 Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence.

Similar presentations


Presentation on theme: "June 5, 20061 Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence."— Presentation transcript:

1 June 5, 20061 Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology MIT-Nokia Architecture Group Helsinki, June 5, 2006

2 2 Why architectural exploration Architects are clever people and can think of a variety of designs But often cannot determine which design is best for a given metric (e.g., power) Too short of time and manpower to go far enough with several designs for proper evaluation  Guess work instead of architectural exploration New design tools can change all that

3 3 This talk Architectural exploration of 802.11a transmitter The goal is to show that it is easy and economical to do so in Bluespec You don’t have to know 802.11a or Bluespec to understand the talk

4 4 802.11a Transmitter Overview ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for > 95% area 24 Uncoded bits One OFDM symbol (64 Complex Numbers) Must produce one OFDM symbol every 4 sec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol

5 5 Combinational IFFT in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,... * * * * + - - + + - - + *j t2t2 t0t0 t3t3 t1t1

6 6 Design Tradeoffs 1.We can decrease the area by multiplexing some circuits It may be a win if the throughput requirements can be met without increasing the frequency 2.Power can be lowered by lowering the frequency, which can be adjusted by changing the voltage power  (voltage) 2

7 7 Combinational IFFT Opportunity for reuse in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 Reuse the same circuit three times

8 8 Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Radix 4 Permute_1Permute_2Permute_3 Stage Counter 16 Radix 4s can be shared but not the three permutations. Hence the need for muxes 64, 4-way Muxes

9 9 Superfolded circular pipeline: Just one Radix-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Radix 4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes Designs with 2, 4, and 8 Radix-4 modules make sense too!

10 10 Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL

11 June 5, 200611 Expressing the designs in Bluespec

12 12 Bluespec code: Radix-4 Node function Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction Polymorphic code: works on any type of numbers for which *, + and - have been defined * * * * + - - + + - - + *j

13 13 Combinational IFFT Can be used as a reference in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 stage_f function repeat it three times

14 14 Bluespec Code for Combinational IFFT function SVector#(64, Complex) stage_f(Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx + 1] = y[1]; stage_temp[idx + 2] = y[2]; stage_temp[idx + 3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); Stage function The code is unfolded to generate a combinational circuit

15 15 Synchronous pipeline rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2( sReg1 ); outQ.enq(f3(sReg2)); endrule x sReg1inQ f1f2f3 sReg2outQ This is real IFFT code; just replace f1, f2 and f3 with stage_f code

16 16 Folded pipeline x sReg inQ rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1; endrule f outQ stage f1 f2 f3 function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcase endfunction This is real IFFT code too...

17 17 Expressing these designs in Bluespec is easy All these designs were done in less than one day! Area and power estimates? Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) Super-Folded (4 Radices) Super-Folded (2 Radices) Super-Folded (1 Radix) How long will it take to write these designs in Verilog? VHDL? SystemC?

18 18 Bluespec Tool flow Bluespec SystemVerilog source Verilog 95 RTL Verilog sim VCD output Debussy Visualization Bluespec Compiler RTL synthesis gates C Bluespec C sim Cycle Accurate FPGA Power estimatio n tool Sequence Design PowerTheater

19 19 802.11a Transmitter Synthesis results for various IFFT designs IFFT DesignArea (mm 2 ) Min. CLK Period(ns) Latency (clks/Sym) ns/output (req 4000) Combinational15.1533.010132 Pipelined15.5012.21249 Folded (16 Radices) 6.2613.01252 Super-Folded (8 Radices) 4.0213.11579 SF (4 Radices)2.8613.121157 SF (2 Radices)2.3313.233317 SF (1 Radix)2.0013.248634 TSMC.18 micron; numbers reported are before place and route. Some areas will be larger after layout.

20 20 Algorithmic Improvements in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 1. All the three permutations can be made identical  more saving in area 2. One multiplication can be removed from Radix-4

21 21 802.11a Transmitter Synthesis results: old vs. new IFFT designs IFFT DesignOld Area (mm 2 ) New Area (mm 2 ) Combinational15.155.91 Pipelined15.506.26 Folded (16 Radices) 6.264.61 Super-Folded (8 Radices) 4.023.57 SF(4 Radices)2.862.75 SF(2 Radices)2.332.21 SF (1 Radix)2.001.67 TSMC.18 micron; numbers reported are before place and route. ??? expected

22 22 802.11a Transmitter Synthesis results with new IFFT designs IFFT DesignArea (mm 2 ) Min. CLK Period (ns) Latency (clks/Sym bol) Min. ns/ output Permitted Clock scaling Combinational5.9133.01013230 Pipelined6.2612.0124983 Folded (16 Radices) 4.6113.0125277 Super-Folded (8 Radices) 3.5713.1157951 SF(4 Radices)2.7513.12115725 SF(2 Radices)2.2113.13331413 SF (1 Radix)1.6713.1576296 TSMC.18 micron; numbers reported are before place and route.

23 23 802.11a Transmitter with new IFFT designs: Power Estimates IFFT Design c1 Area (mm 2 ) c2 Min Freq. c3 Power(mW) @ 100MHz c4 Power(mW) @ Min Freq. c5 Energy/Symb (nJ) c6 Combinational5.911 MHz398.60.3991.594 Pipeline (48 R-4)6.261 MHz438.60.4391.754 Folded (16 R-4)4.611 MHz475.60.4761.902 SF (8 R-4)3.571.5MHz299.70.4461.798 SF (4 R-4)2.753MHz166.20.4991.994 SF (2 R-4)2.216MHz98.70.5922.369 SF (1 R-4)1.6712MHz66.20.7943.178 c3 = min clock x scaling factor; c4 is raw data collected by the Sequence Design PowerTheater c5 = c4xc3/100MHz/voltage scaling(=10); c6 = c5x4 sec Work in progress

24 24 Summary It is essential to do architectural exploration for better (area, power, performance,...) designs. It is possible to do so with new design tools and methodologies. Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. Thanks


Download ppt "June 5, 20061 Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence."

Similar presentations


Ads by Google